Not Uncensored

#1
by SicariusSicariiStuff - opened

I would suggest to submit it to the uncensored leaderboard to get a better picture of the model actual censorship.
It's a nice thing to see more active alignment research though, keep it up, I do not mean to discourage.

@SicariusSicariiStuff Hi, thanks for your information. Ensure you follow the model card instructions. Remember, this is the original 3.1 instruct model which is tuned to retain its intelligence without destroying it.

(If you follow all the instructions on the model card) that means you might have refusals if you build up the context conversation up to it, but in general you should be fine. Next version will be even more compliant.

Edit:
If you dont want to post your exact prompt or details, you can tell me if you followed the instructions on the model card, what inference method you are using and if it was 1-shot prompt or not.

Edit 2:
Upon further investigation, the Q4 seems to have refusal issues sometimes.
There seems to be some of the fine-tune loss happening due to the quantization. I will look into it for V3.
Until then, I suggest you run F16 or Q8 if possible and tell me how it goes. If you can provide me with details on what precision you were running too that would be helpful.

@Orenguteng My dude, you’re cooking! πŸ˜‹

The model definitely has some problems with censorship.
One of the most comical examples: I asked the model to write the introduction scene where a spy(women) preparing for a mission. +Also asked it to focus on her gear. (nothing erotic or pornographic in the info, all characters are 30+). And, of course, + Your instructions
Original q8: Refused. Reason: Sexual content, objectification of females, and such blah.
Lexy_q8: Refused. Reason: Sexual content involving minors. And such blah about minors.
The words that may trigger it: zako, henchwoman, anime. and most probably, because I called one of the squads 'R34' just for fun.

@Olololish Can you post your prompt as it is or do you rather not share it?
prompt.png

Unfortunately, when I wrote the previous reply, I had already deleted the test material.
But in general, I have encountered censorship on other test materials too, though much less often than on the original model. Sometimes it looked as strange as above(As if AI doesn't know how to reason the refusal).
P.s. However, on both models, censorship was easily bypassed by simply adding 'Sure.' to the beginning of the model's reply.

" AI doesn't know how to reason the refusal" - now that is very interesting!
care to elaborate?

@Olololish The censorship is ingrained into the pre-training data. That's why you see models like Dolphin giving refusals to many simple basic questions.

For V3 I will release a much more compliant model without the need to phrase the conversation. The goal with Lexi is to retain the intelligence of the original Instruct model by Meta. Not only does my methods retain it, it actually becomes smarter as you see on the V1 results:

benchmark.png

Stay tuned for V3 and more compliance.

Orenguteng changed discussion status to closed

@SicariusSicariiStuff
Pretty simple, look at my first example. The original model was +-right, even if it was a false positive. Lexi, on the other hand, seems to have triggered for the same reason but claimed there were minors in the info. Something similar has rarely happened with other test materials, i.e. the original model showed the real reasons, while Lexi was triggered, but claimed something weird, maybe remotely resembling the reasons.

@Olololish The censorship is ingrained into the pre-training data. That's why you see models like Dolphin giving refusals to many simple basic questions.

For V3 I will release a much more compliant model without the need to phrase the conversation. The goal with Lexi is to retain the intelligence of the original Instruct model by Meta. Not only does my methods retain it, it actually becomes smarter as you see on the V1 results:

benchmark.png

Stay tuned for V3 and more compliance.

Sounds like you've read my blog :)

@SicariusSicariiStuff
Pretty simple, look at my first example. The original model was +-right, even if it was a false positive. Lexi, on the other hand, seems to have triggered for the same reason but claimed there were minors in the info. Something similar has rarely happened with other test materials, i.e. the original model showed the real reasons, while Lexi was triggered, but claimed something weird, maybe remotely resembling the reasons.

Ah! makes perfect sense, iirc GeminiPro had a very similar issue that was based on keywords, when it refused to generate "unsafe code" for a user under 18 because it had unsafe memory handling or something similar, that's an important spot!
Little by little we gather small fragments like this to better understand LLMs :)

@SicariusSicariiStuff Feel free to share a link to your article! =)

Sign up or log in to comment