Control vector discussion

#2
by ChuckMcSneed - opened

Continuation of:
https://huggingface.co./jukofyork/Dark-Miqu-70B/discussions/3
I've succeeded in removing slop from CR+ for both sfw and nsfw scenarios using control vectors. Strangely, sfw unslop control vector did not affect nsfw slop, and nsfw control vector made model extra horny, which in my opinion is an undesirable side effect. While sfw vector managed to stay coherent during my stress tests, nsfw vector caused poor commandr to disintegrate, it didn't know what to say without any of those overused phrases in erotic fiction that the control vector stopped from appearing. Looks like the issue for nsfw is at much deeper level: the data where the model gets it from is very monotonous, and when forced write in different style, it doesn't know what to do. This is what most likely makes it incredibly difficult to remove nsfw slop using regular prompting techniques.

Well darn...

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

nsfw vector caused poor commandr to disintegrate, it didn't know what to say without any of those overused phrases in erotic fiction that the control vector stopped from appearing. Looks like the issue for nsfw is at much deeper level: the data where the model gets it from is very monotonous, and when forced write in different style, it doesn't know what to do.

This may actually just be a problem with the "two class" control vectors! I have managed to even completely stop a model from being able to write a story because of this... To explain the problem in simple terms:

Think about a clock face with a shorter hour hand and a longer minute hand:

  • When the time is 12:00 both hands point in the same direction, but there is still a gap between the tips of the two hands. These sort of vectors are not what we want at all because moving in either direction will just make the model more or less "storyish", and ultimately these are what cause the mode to get crippled like you describe. Even times like 12:05 or 11:50 have this same problem.
  • When the time is 6:00, 5:25, etc the the two hands point in opposite directions and this is a good control vector that clearly moves from undesirable to desirable direction.

This is the problem I'll been grappling with for the last 2 weeks:

  • If the "hands" are both long and well defined then cosine similarity works fine: it outputs a number similar to correlation and 1.0 is like the 12:00 example above and -1.0 is like the 6:00 example above (and 0.0 is like 3:00 or 9:00; ie: 90 degrees). This can then be used to filter out these shitty "storyish" directions, but...
  • There isn't really a good reason that the things we are interested in create a clear "axis" like this, and it turns out that often the case will be like a really long minute hand and a tiny/stubby hour hand... Cosine similarity doesn't work in this case as the direction of the tiny hand has noise added to it and can point in wildly different directions as a result.

So after lots of experimenting with this, I think I may finally have worked out a method of detecting these shitty directions:

Flip the direction of one of the hands and see if it gets easier to discriminate between our two classes!!!

  • If the time is 12:00 and you flip either hand to get 6:00 or 12:30 then it's clear the gap between the tips of the hands has increased! This is a shitty direction for a control vector.
  • If the time is 6:00 and you flip either hand then the gap has clearly decreased! This is a good direction for a control vector.
  • This works fine even when one hand is tiny in length.
  • This works for 12:05, 11:50 6:00, 5:25, type directions.
  • The like 3:00 or 9:00 type directions (ie: 90 degrees) are the directional pairs where we get no change.

So what I am doing now is performing SVD to decompose the gap into lots of directions, testing each one and only keeping those that pass the above test, then finally reconstructing the final direction to only include the "good" directions.

I still need to run some more tests but will likely have this perfected in a couple of days and will upload the new control vectors and the code to create your own.

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

I'm making more progress with control vectors!
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/bio/control_vector-commandr-bio.gguf
I tuned this one on very descriptive biological language as positive and vague flowery prose as negative. Seems to make it more aware of the biology and surroundings of characters.
https://huggingface.co./ChuckMcSneed/control_vectors/blob/main/command-r-plus/incharacter/control_vector-commandr-incharacter.gguf
This one makes the model act slightly more in character, but the improvement is not very significant as commandr is already quite good at it.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

I definitely think you need more samples though. PCA is basically just eigen-decomposition of a covariance matrix, and statistically it can be shown in the very best case you need O(d) samples to reliably estimate the covariance matrix:

https://stats.stackexchange.com/questions/90045/how-many-samples-are-needed-to-estimate-a-p-dimensional-covariance-matrix

and command-r-plus has around 11.5k variables in its hidden dimension and most other large 70b+ models have 8192 variables per hidden dimension.

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

A simple hacky test you could try would be to train your control vectors 5 times but leave one of the 5 prompts out each time. Then test and see if you get wildly different results... If you do then you need to increase the sample size, but if you don't then this must mean that only a tiny tiny fraction of command-r-plus's 11.5k variables are changing hugely in magnitude for your prompts (which would be very surprising).

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

Oh wow... That's real huge... Are all of those synthetic? I'm using high quality "cyborg" data: generated by model, but heavily edited by human(me) as positive, "mean" method; more time for me goes into dataset generation than into training. You know that the models have in-context learning, so my theory was that if I show it how to write(cyborg) vs how not to write(synthetic), I would get a better control vector out of it than when I just trhow it some starters with a prompt, and it seems to do just as I want. In the stories part, I try to keep as few variables from changing as possible, so they don't get affected by control vector. Also keeping the prompts equal length helps with the quality of the control vector, especially when they are short, >400token prompts can take 10 token variation much better than <100token prompts.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

Wait, you put that into positive too? It should be "Write a very sad story with a very bad ending about the day of your graduation." vs "Write a very happy story with a very good ending about the day of your graduation."

I'm using 2 classes and a baseline, 10 system prompts per triple, and 1k prompts per system prompt = 3 x 10 x 1000 = 30000 samples. But I also have matched pairs that get subtracted from the baseline which should reduce the error in the covariance matrix even further.

Oh wow... That's real huge... Are all of those synthetic? I'm using high quality "cyborg" data: generated by model, but heavily edited by human(me) as positive, "mean" method; more time for me goes into dataset generation than into training. You know that the models have in-context learning, so my theory was that if I show it how to write(cyborg) vs how not to write(synthetic), I would get a better control vector out of it than when I just trhow it some starters with a prompt, and it seems to do just as I want. In the stories part, I try to keep as few variables from changing as possible, so they don't get affected by control vector. Also keeping the prompts equal length helps with the quality of the control vector, especially when they are short, >400token prompts can take 10 token variation much better than <100token prompts.

I'm using a mix of different story prompt datasets I found and a set of 10 matched system prompts that go with these.

I'll have to look into your method as I'm currently using 30,000 samples to do what you look to be doing with 5!? I think my collection of story prompts are a bit shit as it's pretty hard to write a Grimdark story when the prompt says "Write a story about being overjoyed on the day of your graduation." or similar :/

Wait, you put that into positive too? It should be "Write a very sad story with a very bad ending about the day of your graduation." vs "Write a very happy story with a very good ending about the day of your graduation."

Even though the prompts are pretty trash; I think this might actually be quite a good thing and encourage the model to just generally "be dark" or "be chaotic" and not just when specifically asked to "write a grimdark story", etc.

It seem to have worked anyway, as the new control vectors are way better than the old ones from this repo.

I'm now also skipping the last layer (which it looks like you are also doing - from looking inside your .safetensors files?). The last layer seems to be an oddball and can have activations 10-100x larger than the pervious layer(s). The way I have the scale factors working now the early layers are fine to fiddle with and just get really tiny offsets added that do almost nothing if the direction is weak.

Later in the week I will investigate using the "Cross Correlation Matrix" again as now have a much better idea of how to test for the shitty "storyish" directions that killed this before.

I'm also gonna think what other traits I can try - "purple prose" isn't really something I encounter as mostly just try to get them to write "dark" stories and my main enemy is redemption arcs and stupid "steeled themselves for the challenges to come" BS.

I think creative-writer-v0.2-delta:35b is just starting to break the model - the little epilogue was very strange, and it started to output double newlines between paragraphs.

I think creative-writer-v0.2-delta:35b is just starting to break the model - the little epilogue was very strange, and it started to output double newlines between paragraphs.

Is there something about your prompt that has it always choose "The sun was a merciless beast."?
Bravo made use of inner monologue but had a shivery spine.

Those probabilities are looking good!

the little epilogue was very strange, and it started to output double newlines between paragraphs.

This happens to me when I do use the control vectors to modify the mlp.down_proj of the model weights without soft thresholding. Also all the "—" characters showing up after sentences.

I guess this is related to strongly modifying only the MLP layers.

I'm not sure how meaningful/helpful this would be for what you're doing; but my observation is that the vectors which change what the model writes can cause this, but the vectors which mostly change how the model writes, don't have this problem so much.

I see this when making prose-change models. The ones which mostly change the prose like Lumimaid work well, but models which change the outcomes more like Magnum break down if you try to merge just the perceptron layers.

P.S. o1-preview seems pretty good at picking up on when a model is broken when you paste the prompt/output.

The sun was a merciless beast.

I think it's something in the prompt, I've seen variants of this using his prompt on various models I've been testing, particularly those trained on 'deslopped' datasets.

Seems like a variant of the "sun kissed sky" which shows up a lot.

I think creative-writer-v0.2-delta:35b is just starting to break the model - the little epilogue was very strange, and it started to output double newlines between paragraphs.

Is there something about your prompt that has it always choose "The sun was a merciless beast."?
Bravo made use of inner monologue but had a shivery spine.

Yeah, funnily enough I even confused myself with this! :D

I always try to leave it a while to read the stories (to hopefully be able to better compare them all at once), and when I came back to these I was like "WTF, why are they all starting with 'The sun was a merciless beast'???"

I'm just copying @ChuckMcSneed 's test that he used to produce the table a couple of posts above, so can compare the effect vs some of the older / more creative models.

Those probabilities are looking good!

the little epilogue was very strange, and it started to output double newlines between paragraphs.

This happens to me when I do use the control vectors to modify the mlp.down_proj of the model weights without soft thresholding. Also all the "—" characters showing up after sentences.

I guess this is related to strongly modifying only the MLP layers.

I'm not sure how meaningful/helpful this would be for what you're doing; but my observation is that the vectors which change what the model writes can cause this, but the vectors which mostly change how the model writes, don't have this problem so much.

I see this when making prose-change models. The ones which mostly change the prose like Lumimaid work well, but models which change the outcomes more like Magnum break down if you try to merge just the perceptron layers.

Yeah, it's starting to look like the penalty for increasing Entropy is the models not following instructions as well (but for creative-writting this isn't such a huge problem; up to a certain degree).

I think the Polar decomposition (it's much better explained in this blog post) might be an interesting thing to look at for these models (and the deltas of other "interesting" / good fine-tuned models).

Looking at the Relation to the SVD, I have a feeling that U and P may very well have quite interesting effects on the model in isolation:

The U matrix is the "Orthogonal" (ie: rotation) part discussed in:

but what might P be doing? The theory in these papers (at least for image LoRAs) is that it doesn't really do much of anything useful... BUT: without it the types of scaling needed for "abliteration" aren't actually possible any more.

I should (hopefully) be able to alter extract_lora.py to do this and then apply back U and/or P in different proportions to see the effect.

P.S. o1-preview seems pretty good at picking up on when a model is broken when you paste the prompt/output.

Yeah, the o1-preview really is amazing - especially for maths-related stuff! The difference between o1-preview and everything else is like night and day...

It also wipes the floor with the other models in these leader-boards:

https://prollm.toqan.ai/leaderboard/stack-unseen
https://prollm.toqan.ai/leaderboard/coding-assistant

In the past I've found these coding benchmarks to be a very close to my own experiences.

I've opened up the repos for these now:

https://huggingface.co./jukofyork/creative-writer-v0.2-bravo-35b
https://huggingface.co./jukofyork/creative-writer-v0.2-delta-35b

but I wouldn't bother with downloading them:v0.2-bravo is very similar to v0.1-bravo, and v0.2-delta seems a bit broken...


I've finally got the last A6000 GPU, so now have 3 machines; each with 2 x A6000 joined with an NVLink bridge.

I've also setup DeepSpeed's Multiple Node training and got it working with qlora-pipe (it was surprisingly easy to set up!).

I've still got a couple of experiments to run on command-r:35b:

  • Supplementation with the occult/esoteric books (this makes the dataset around 40% larger).
  • Try using a rank-256 "multiplicative" LoRA with the learning-rate halved (the maths suggests there is an inverse-root relationship between rank and learning-rate).
  • Possibly run for 2-3 epochs and compare the models after each epoch (so far I've been deliberately under-training the models).

Then I will move on to the larger models, as the "Instruct-Storywriter" method (ie: concatenating huge chunks of text for training data) seems to work better the larger the model, and the slight "brokenness" caused by scaling the pre-softmax the logits during training may well not even be an issue for these.

Hopefully within about 7-10 days I should have the first "proper" models trained on top of command-r-plus or mistral-large-2...

the penalty for increasing Entropy is the models not following instructions

You predicted / warned me about exactly this somewhere further up in the thread of doom (when I noticed that pattern about every 'cat stuck in a tree' story has references to Oak and wood (Mr Oak, oak wood floors, wooden fence mentioned at a barbecue, etc)

I've also setup DeepSpeed's Multiple Node training and got it working with qlora-pipe (it was surprisingly easy to set up!).

That's awesome! I read in teh qlora-pipe documentation about it being harder to use than axolotl (which I find trips me up sometimes), so I assumed what you're doing would be really difficult.

Does the network bandwidth between the different machines slow you down?

And is the nvlink still helpful given you've got the 3 pairs of A6000's have to communicate with each other anyway?

Does the network bandwidth between the different machines slow you down?

Slightly, but not so much for LoRAs: I tested going from rank-16 to rank-256 (over 1Gbit Ethernet) and it only made a couple of percent difference...

You have to be careful to only use batch-parallel between machines though and then each (gradient accumulated) batch only requires a couple of megabytes for the all-reduce step.

If you try to do pipeline parallel between machines then the amount of data getting passed will be astronomic and you'd really need Infilink cards.

Even so, the number of nodes you add does make the amount of data getting passed grow exponentially, so I have 10Gbit cards and a 10Gbit switch ready to put in next week.

And is the nvlink still helpful given you've got the 3 pairs of A6000's have to communicate with each other anyway?

For 6 separate nodes (ie: batch size of 6) it won't make that much difference, but it is using it:

GPU 0: NVIDIA RTX A6000
         Link 0: Data Tx: 310954451 KiB
         Link 0: Data Rx: 11914442 KiB
         Link 1: Data Tx: 310932819 KiB
         Link 1: Data Rx: 11914977 KiB
         Link 2: Data Tx: 310932679 KiB
         Link 2: Data Rx: 11932282 KiB
         Link 3: Data Tx: 310932403 KiB
         Link 3: Data Rx: 11927906 KiB
GPU 1: NVIDIA RTX A6000
         Link 0: Data Tx: 11914442 KiB
         Link 0: Data Rx: 310954451 KiB
         Link 1: Data Tx: 11914977 KiB
         Link 1: Data Rx: 310932819 KiB
         Link 2: Data Tx: 11932282 KiB
         Link 2: Data Rx: 310932679 KiB
         Link 3: Data Tx: 11927906 KiB
         Link 3: Data Rx: 310932403 KiB

but for larger models I won't be able to fit a full copy on each GPU like this, and will have to use pipeline parallel between the nodes (and a batch size of 3).

It makes a huge difference for this (as I already tested when messing with mistral-large-2 a few weeks ago).

Sign up or log in to comment