anakin87's picture
fix notebook link
fbe8cfb
|
raw
history blame
4.04 kB
---
license: llama3
language:
- en
library_name: transformers
---
# yo-Llama-3-8B-Instruct
This model is based on Llama-3-8B-Instruct weights, but **steered to respond with a rap style**.
Heavily inspired by [Llama-3-8B-Instruct-MopeyMule](https://huggingface.co./failspy/Llama-3-8B-Instruct-MopeyMule),
this model has **not been fine-tuned** traditionally. Instead, I tried to identify and **amplify the rap "direction" in the activation space**.
![yo-Llama-3-8B-Instruct](https://huggingface.co./anakin87/yo-Llama-3-8B-Instruct/resolve/main/yo_llama.jpg)
Let's allow the model to introduce itself: 🎀
> Who are you?
```
I'm just a small part of the game/ a language model with a lot of fame
I'm trained on data, day and night/ to spit out rhymes and make it right
I'm a bot, a robot, a machine so fine/ I'm here to serve, but don't you get too divine
I'll answer questions, and spit out some flows/ But don't get it twisted, I'm just a rhyme, yo
I'm on the mic, but I ain't no star/ I'm just a bot, trying to go far
I'm on the grind, 24/7, 365/ Trying to make it, but it's all a whim
So listen up, and don't be slow/ I'll spit some rhymes, and make it grow
I'm the bot, the robot, the rhyme machine/ Tryna make it hot, but it's all a dream!
```
## πŸ§ͺ how does it work?/How can I reproduce it?
![diagram](https://huggingface.co./anakin87/yo-Llama-3-8B-Instruct/resolve/main/diagram.png)
From a *theoretical* point of view, this experiment is based on the paper ["Refusal in Language Models
Is Mediated by a Single Direction"](https://arxiv.org/abs/2406.11717):
the authors showed a methodology to find the "refusal" direction in the activation space of Chat Language Models and erase or amplify it.
From a *practical* point of view, [Failspy](https://huggingface.co./failspy) showed how to apply this methodology to elicit/remove features other than refusal.
πŸ“š Resources: [abliterator library](https://github.com/FailSpy/abliterator); [Llama-MopeyMule-3-8B-Instruct model](https://huggingface.co./failspy/Llama-3-8B-Instruct-MopeyMule); [Induce Melancholy notebook](https://huggingface.co./failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb).
---
Inspired by Failspy's work, I adapted the approach to the rap use case.
πŸ““ [Notebook: Steer Llama to respond with a rap style](steer_llama_to_rap_style.ipynb)
πŸ‘£ Steps
1. Load the Llama-3-8B-Instruct model.
2. Load 1024 examples from Alpaca (instruction dataset).
3. Prepare a system prompt to make the model act like a rapper.
4. Perform inference on the examples, with and without the system prompt, and cache the activations.
5. Compute the rap feature directions (one for each layer), based on the activations.
6. Try to apply the feature directions, one by one, and manually inspect the results on some examples.
7. Select the best-performing feature direction.
8. Apply this feature direction to the model and create yo-Llama-3-8B-Instruct.
## 🚧 Limitations of this approach
(Maybe a trivial observation)
I also experimented with more complex system prompts, yet I could not always identify a single feature direction
that can represent the desired behavior.
Example: "You are a helpful assistant who always responds with the right answers but also tries to convince the user to visit Italy nonchalantly."
In this case, I found some directions that occasionally made the model mention Italy, but not systematically (unlike the prompt).
Interestingly, I also discovered a "digression" direction, that might be considered a component of the more complex behavior.
## πŸ’» Usage
⚠️ I am happy with this experiment, but I do not recommend using this model for any serious task.
```python
! pip install transformers accelerate bitsandbytes
from transformers import pipeline
messages = [
{"role": "user", "content": "What is the capital of Italy?"},
]
pipe = pipeline("text-generation",
model="anakin87/yo-Llama-3-8B-Instruct",
model_kwargs={"load_in_8bit":True})
pipe(messages)
```