Have you considered revisiting another pure Vicuna merge using nontoxic-bagel?

#1
by ParasiticRogue - opened

I've found that most of the time models which have a similar template do better when merging just from private testing, and your original Cabybara-Tess model was one of the better ones, even compared to your newest merges like v5. Now that there's another decent model using that format and context I was wondering how a 3-way merge with Tess-1.4, Nous-Capybara, and Nontoxic-Bagel might turn out given the synergy, with the nontoxic variant seeming to be the best bagel flavor atm given it's ranking position and some testing on my end. I'd do it myself but I'm not sure what the best approach is given my inexperience, so I wanted ask if you have any interest before I start bashing my head trying to get it right, lol.

Yeah I have considered this, and ran into issues with <im_end> appearing in Vicuna prompt in a testing v6 merge.

One major issue is that DPO-bagel completely falls apart after 8K-16K, and the goal of my merges is to stay coherent at high context. The SFT version is less severe but still problematic. So it needs other 200K models to "extend" its context length if its included.

Another issue is that Bagel has mixed prompts anyway...

I may do what you suggested though, and if I do make a bigger merge, Capybara, Tess and Pallas will all have higher weights to try and emphasize the Vicuna prompt. I will include a little Tess 1.0 as well since some people tell me they prefer the original Tess 1.0-Capybara merge.

One solution I've seen people do to try and normalize prompts when doing bigger merges if one of the models has a different prompt is to use SLERP with one that has the correct template first, and then using DARE/TIES afterwards for the bigger merge. Considering Pallas takes after Tess already, and Bagel has Capybara in the data, perhaps something like this would be optimal to avoid too much samey overlap?

Nontoxix-Bagel-V0.2 + Pallas-v4 using SLERP.
Pallas-Bagel + Tess-1.4 + Nous-Capybara combined at 33% each using DARE/TIES.

This is just speculation on my part given some of the other merges I've seen at the 7b range, so it's entirely possible I'm off base.

That's an interesting idea.

Capybara + Bagel + SUS-Hermes + Tess 1.0 SLERP. Maybe some of the other non Vicuna models in there as well at low weight?

Then Pallas + Tess 1.4 + very low density bagel + low weight capybara (again) + merge?

One other factor is that the low context models like bagel should ideally be "culled" with a very low DARE density to preserve the 200K context, at least according to some unscientific testing of mine. SLERP unfortunately doesn't do this, but task arithmetic can.

Possibly, given what I've heard and talked to with others before. But for the first merge to optimize prompting I wouldn't cast too wide of a net with different prompts simply because it might dilute it.

SUS-Hermes is probably fine since the original SUS-Chat model is some weird hybrid of Alpaca and Vicuna, so it would probably understand the Vicuna prompt to an extant? But idk if I'd add anything else like AEZAKMI or PlatYi to the mix personally. Plus most of AEZAKMI's data is sourced from Spicy/Airboros which is already contained in Capybara and Bagel to an extant.

Other then that I'd agree with the mix and slerp you proposed. Maybe swap the Tess versions around, with the Tess-1.0 at a slightly smaller mix compared to Pallas, so the final mix has a better diversity considering Pallas uses 1.4. So something at around this for the final mix below would be my best guess.

Tess-1.0 at 20%
Pallas (Either version 3 or 4 just from the rankings) at 30%
Capybara at 30%
Bagel-SLERP at 20%

Course those are just my estimates. Raise or lower depending on what you find best, but it's probably best to have the original Capybara have a slightly higher digit once it's all said and done just because it's still seems to be the best atm. BTW I've seen you on the regular and DPO variants of Bagel, but what do you think of the third Nontoxic version and have you tried it?

https://huggingface.co./jondurbin/nontoxic-bagel-34b-v0.2

As an aside you could also consider making 2 separate SLERPS using one of the Tess variants and/or Capybara to possibly strengthen both Bagel and SUS-Hermes' individual strengths for the final merge. As an example:

Suss-Hermes + Capybara
Bagel + Tess-v1
SLEEP-Bagel + SLERP-Hermes + Pallas-v4 + Capybara

I haven't yet. TBH I haven't really touched DPO anymore once I found out the context length was so short, but if nontoxic bagel is a better finetune that could be interesting...

Also, what do you think about gradients? EG a non Vicuna model could have a gradient like:

weight: [0, 0.2, 0.2, 0.2, 0.2, 0.2, 0]

So that it would it would have no influence at the input and output. Does that even make any sense? Would it help suppress the prompt syntax?

Also, another idea bouncing around my head is using a YI 4K as a base in a ties model to "extend" the 4K models like SUS. For instance:

Yi 4K base -> Capybara high density + SUS low density

Yi 200K is technically a finetune of Yi 4k I believe, so this would more directly merge the 200K training + prompt syntax into other models.

I don't recall seeing any merges using that specif method, only the standard 0.0>1.0/1.0>0.0 method like from here:

https://github.com/Gryphe/BlockMerge_Gradient

The only thing i can say is do a test run to see if it works or not, maybe using a smaller 7b model and comparing with another established one like OpenHermes-2.5-neural-chat-v3-3-Slerp in the same mold. I could probably make a Hermes-Chat version later this weak to compare formats and overall chatting prowess if you don't want to.

As for the model base idea...I'm not sure it would make too much of a difference. Context training seems to be tied to the individual models themselves, rather than the base, since I've tested someone doing something similar with Mistral, with a guy taking a couple models from v1 over to v2 and the context only worked up to 8k instead of 32k without rope scaling. No idea if the reverse works though.
To get another idea from the Mixtral frankenmerges, it seems people chose bases that would better fit the prompting so that the model would understand better what is going on underneath the hood. It's why Bagel 7b was picked at first as the base for the 8 separate MoEs, since you could throw different prompts style models in and the base would "sorta" understand them all to a degree as long as the base was trained on it. Course a standard format works better then throwing a whole bunch of different clowns into the clowncar, as I've discovered when trying to do so on my own and gave up. No idea if this translates to regular models though, or if choosing a fine-tune with similar prompting as a base would be better for it. idk.
It's something that might be worth testing on a smaller model if nothing else, but raising context is something I've been banging my head against when doing research and testing, and it's not something I've seen to work fully aside from merging a longer model as a band-aid. Even those so called 16k Mistral extension models start to putter out before hitting 12k.
So best bet from my inexperienced advice for your merge atm, without further testing ofc, is keep it simple and by the books in continuing to do what you know best with the model's parameters.

Also just to revisit this, I am thinking about doing a Vicuna centric merge with SLERP leftovers just as you said, but its on the backburner atm. I may not get to it this weekend, so you should totally investigate doing it yourself if you wish. I can even post the mergekit configs to do it.

I might do that. Perhaps I'll focus on doing a ChatML 34B version later using Dolphin as the main to compare later with yours, so I'd be happy to have some configs as a guideline to look at. How much V/RAM does 34B merges take up? I have 24gb of VRAM and 32gb of regular ram atm. Do i have to set up an environment with swap/virtual memory first with a decent chunk set, or do you typically use a secondary service like runpod for bigger models?

I might do that. Perhaps I'll focus on doing a ChatML 34B version later using Dolphin as the main to compare later with yours, so I'd be happy to have some configs as a guideline to look at. How much V/RAM does 34B merges take up? I have 24gb of VRAM and 32gb of regular ram atm. Do i have to set up an environment with swap/virtual memory first with a decent chunk set, or do you typically use a secondary service like runpod for bigger models?

I have the exact same setup, 24GB/32GB. I do it all locally.

You can merge 4 34B models (+ the base model) in 24GB VRAM + 32GB CPU RAM. Maybe 5? You still need swap space, but just a little. Its pretty fast!

Past that, you need to do a cpu only merge and set up a lot of swap on an ssd. Probably at least 64GB. This merge is slower but still gets done in a reasonable amount of time.

I might do that. Perhaps I'll focus on doing a ChatML 34B version later using Dolphin as the main to compare later with yours, so I'd be happy to have some configs as a guideline to look at. How much V/RAM does 34B merges take up? I have 24gb of VRAM and 32gb of regular ram atm. Do i have to set up an environment with swap/virtual memory first with a decent chunk set, or do you typically use a secondary service like runpod for bigger models?

I have the exact same setup, 24GB/32GB. I do it all locally.

You can merge 4 34B models (+ the base model) in 24GB VRAM + 32GB CPU RAM. Maybe 5? You still need swap space, but just a little. Its pretty fast!

Past that, you need to do a cpu only merge and set up a lot of swap on an ssd. Probably at least 64GB. This merge is slower but still gets done in a reasonable amount of time.

Alright then. Would I still need to muck about with swap if it's only 3 models (+base) for the merge? I'm thinking of just sticking to Bagel+Dolphin+Hermes for now. Maybe using AEZAKMI to slerp with Bagel first for hopefully better prompting on the ChatML format, idk. I'll see how the first merge works first. Thanks for the insight thus far!

Quick update on the HermesChat 7b test using similar gradient methods you suggested; the model seems to work after some brief testing, but idk if it's better then the regular merge methods with similar settings tbh. Comparing it to the other version listed mine seemed to do worse personally. Perhaps adjusting it slightly might help, but I'd say leave it for now.

Yeah, I briefly tested it locally and am unsure myself. The perplexity is not that different though.

The prompting methods discussed should be good enough for your merge later, so don't sweat it too much. I was looking at potential models to experiment with Yi on my end and I came across this which might interest you in the Vicuna merge.

https://huggingface.co./bhenrym14/platypus-yi-34b

It's not 200K, but it might be useful to slerp with Sus-Hermes since they are both 32k. Therefore it would leave more room to add a second Tess variant for the final merge later to both extend context and not be too samey in structure. What do you think about this for the mix?

Platypus + SUS-Hermes = Plat-Hermes
Capybara + Nontoxix-Bagel = Capy-Bagel
Tess-v1 + Pallas-v4 + Capybara + Plat-Hermes + Capy-Bagel.

The prompting methods discussed should be good enough for your merge later, so don't sweat it too much. I was looking at potential models to experiment with Yi on my end and I came across this which might interest you in the Vicuna merge.

https://huggingface.co./bhenrym14/platypus-yi-34b

It's not 200K, but it might be useful to slerp with Sus-Hermes since they are both 32k. Therefore it would leave more room to add a second Tess variant for the final merge later to both extend context and not be too samey in structure. What do you think about this for the mix?

Platypus + SUS-Hermes = Plat-Hermes
Capybara + Nontoxix-Bagel = Capy-Bagel
Tess-v1 + Pallas-v4 + Capybara + Plat-Hermes + Capy-Bagel.

Oh yeah, I didn't realize Platypus Yi was Vicuna format. Yeah I think that is a good idea, to "Vicunafy" Sus Hermes.

That looks like a excellent recipe.

What I would also suggest is to merge some Yi 200K base to "200kify" plat-hermes without affecting the prompt format. That actually seems to work in my tests, and it makes sense because Yi 200K seems to be a Yi 4K finetune.

If that method works for you then go for it! I see no logical reason not to, So long as it doesn't effect prompting ofc.

Quick question: Do you get any errors when merging any of the pure Vicuna models on your end with Mergekit? I've been trying to see how using the custom versions of SUS-Chat-RP and UNA-Bagel might mix with Vicuna, but every time I use Tess or Pallas I get this.

(WARNING:root:Using submatrix of /home/oem/Desktop/merge/migtissera_Tess-34B-v1.4:model.embed_tokens.weight)

Meanwhile Capybara just gives me a "killed" message at around 13% of the merging process.

I've tried updating Mergekit and using different merge methods, but none of them seem to work. Is there something I'm missing when doing these Yi models? Cause I've been able to merge the others just fine in the 34b range like Nous-Hermes and the other 2 models. Is their some unique settings you use when doing this stuff? Cause I've been bashing my head for a good week now trying to do this, and I didn't see anyone having these issues on the github either...

(WARNING:root:Using submatrix of /home/oem/Desktop/merge/migtissera_Tess-34B-v1.4:model.embed_tokens.weight)

This is just a warning, but technically you need to use the "union" tokenizer merge to get it to go away and merge the vocabularies of all the models.

You also need to start mergekit with --lazy-unpickle to get it to merge with a reasonable amount of RAM, and --cuda if you are merging 4 or fewer models. If the process just shuts down, that means its out of memory and you need to add more swap.

I haven't run into any issues with the Vicuna models specifically.

Some models are funky though, yeah. For instance, Nous Hermes and Yi Chat have some kind of misconfigured tokenizer/vocab that breaks mergekit and probably leaves the models themselves slightly broken.

Good luck, I'm happy to help if stuff still isn't working.

I guess that solves one problem with the tokenizer. I assume I take out the "Base model" path and just use an empty "Union" folder for it to generate, yeah?

Still, even when using lazy unpickle and cuda it still kills the model at 13%, even if I'm only merging two models. And I've been able to merge 3 34B models before just fine, so I don't think it's a memory issue... The onle thing different I can find is Capybara has some extra files inside like tokenization_yi.py which is different from the others, and coincidentally Dolphin has something similar in it as well and I've been having trouble doing it as well (Killed at 13%). I think it's because they were the first proper models Yi made I guess? None of the others have that problem for me.

Also I'm surprised you've been having the exact opposite problem with Hemes and SUS-chat, lol. You want me to merge them on my end and then you can use them for your stuff later? It might iron out your problems.

Oh yeah. I thought Capybara was llamafied, if the Yi tokenizer is in there, that's trouble.

I... may have fixed it locally by just slapping in another tokenizer. It was so long ago I can't remember TBH.

Charles did post code to officially "llamafy" the base weights of old Yi models. I sucessfully used it on Deepsex, but unfortunately it doesn't do anything to the tokenizer. For that I just copied Yi's new tokenizer in.

https://huggingface.co./chargoddard/Yi-34B-Llama/discussions/7

As for where to put the "union," see the config here: https://huggingface.co./brucethemoose/Yi-34B-200K-DARE-megamerge-v8#configuration

I did a quick test using the methods you recommended. Good news? I'm not having problems with the Tess models being fussy with token weights anymore! Bad news? Still can't get Capybara to go past 13% at all... Tried using Yi's updated one, but still nothing... Even tried jamming Doctor Shotgun's Capybara-RP's entire folder in minus the pytorch/safetensors files, since that seemed to be the closest link I could find, but still stops at 13% even then, and I had no issues with his model to merge with others. Last ditch effort and used the script you provided, and...nothing but more errors. Face desk

I hate to ask this, but would it be too much trouble if you upload your Capybara version to huggingface for others looking to merge like me? I'd really be grateful.

Yeah sure, uploading in a sec.

If its crashing at 13%, it does make the think its running out of memory though. That is coincidentally about when memory usage peaks for me.

It will take awhile, but be here when it's done: https://huggingface.co./brucethemoose/Capybara-Fixed-Temp

I'll check it out later, thanks! Still... I don't understand why the original would OOM for me though, considering I'm able to merge 3 models just fine, and that's without applying lazy pickle...

Back with some feed...sorry, meant to say here's some feedback. Also some findings which may or may not interest you in your merges.

Was able to use the Capybara model without hanging on 13% but it also gave some some warnings about using another tokenizer. However, just making a copy of it and then slerping the 2 into chargod's Yi base solved the issue, and now I use that no problem. A separate issue with Nontoxic-Bagel getting killed at around 58% when doing wider merges was also happening, yet somehow being able to finish when there were only 2 models peasant. Like before, slerping the copies together fixed it and now it merges fine on a wider scale. It doesn't seem to fix everything, since the Tess variants still produced the warning even when slerped together, unless union is used. Maybe the issues you had with Hermes could be fixed in this manner? It's a technique I like to call self-slerp, not to be confused with self-su...actually, never mind.

There's also 2 extra smaller Vicuna models you could potentially use if you want.

NobodyExistsOnTheInternet/Yi-34B-GiftedConvo-merged

Sao10K/NyakuraV2-34B-Yi-Llama

GiftedConvo doesn't have anything too unique about it, since half of it's data is already stuff Capybara uses. But the model is at least competent from some brief testing. Nya on the other hand is unique, but much like regular Capybara, it has some minor issues, mainly that it likes to use dollar $igns at the end for some reason, and I've encountered it 3 times thus far. So if you use it you might have to add "$" as a an extra custom stopping string. Hope you don't use your models for economical purposes! So if you wanted some more slerp fodder for one of the SUS models, or even Hermes, you have options now.

For now I'm tinkering with just a simple merge of 4 models, but do continue with your mega merge and trying to extend the conga-line and seeing where that goes.

Thanks, I will look into these.

And I am sorry about the hanging in the middle of the merge. You can start a GH issue on mergekit or just chat with Charles on the Kobold AI discord to try and figure that out.

It's not that big of a deal honestly, since I was able to fix it, even if the method was kinda scuffed, lol.

Sign up or log in to comment