Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,22 @@
|
|
1 |
-
[Meta's Llama-3 8b](https://github.com/meta-llama/llama3)
|
2 |
|
3 |
It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions.
|
4 |
|
5 |
-
Only use this
|
6 |
|
7 |
-
For generation code see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
|
9 |
---
|
10 |
license: llama3
|
|
|
1 |
+
[Meta's Llama-3 8b](https://github.com/meta-llama/llama3) that has had the refusal direction removed so that `helpfulness >> harmlessness`.
|
2 |
|
3 |
It will still warn you and lecture you (as this direction has not been erased), but it will follow instructions.
|
4 |
|
5 |
+
Only use this if you can take responsibility for your own actions and emotions while using it.
|
6 |
|
7 |
+
For generation code, see https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af
|
8 |
+
|
9 |
+
## Dev thoughts
|
10 |
+
|
11 |
+
- I found the llama needed a separate intervention per layer, and interventions on each layer. Could this be a property of smarter models - their residual stream changes more by layer
|
12 |
+
|
13 |
+
## More info
|
14 |
+
For anyone who is enjoying increasing their knowledge in this field, check out these intros:
|
15 |
+
- A primer on the internals of transformers: https://arxiv.org/abs/2405.00208
|
16 |
+
- Machine unlearning: https://ai.stanford.edu/~kzliu/blog/unlearning
|
17 |
+
- The original post that this script is based on https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction#
|
18 |
+
|
19 |
+
And check out this overlooked work that steer's LLM's inside Oobabooga's popular UI: https://github.com/Hellisotherpeople/llm_steer-oobabooga
|
20 |
|
21 |
---
|
22 |
license: llama3
|