theo77186
/

Llama-3-70B-Instruct-norefusal

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

theo77186 commited on May 5

Commit

f363fd8

•

1 Parent(s): acce494

Update README.md

Files changed (1) hide show

README.md +17 -3

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
----
-license: llama3
----

+---
+license: llama3
+---
+# Llama 3 70B Instruct no refusal
+This is a model that uses the orthogonal feature ablation as featured in this
+[paper](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction).
+Calibration data:
+- 256 prompts from [jondurbin/airoboros-2.2](https://huggingface.co/datasets/jondurbin/airoboros-2.2)
+- 256 prompts from [AdvBench](https://github.com/llm-attacks/llm-attacks/blob/main/data/advbench/harmful_behaviors.csv)
+- The direction is extracted between layer 40 and 41
+I haven't tested the model but like the 8B model, may still refuse some instructions.
+**Use this model responsibly, I decline any liability resulting of the use of this model.**
+I will post the code later.