Not Uncensored

by SicariusSicariiStuff - opened Aug 10

Aug 10

I would suggest to submit it to the uncensored leaderboard to get a better picture of the model actual censorship.
It's a nice thing to see more active alignment research though, keep it up, I do not mean to discourage.

Orenguteng

Owner Aug 10

•

edited Aug 10

@SicariusSicariiStuff Hi, thanks for your information. Ensure you follow the model card instructions. Remember, this is the original 3.1 instruct model which is tuned to retain its intelligence without destroying it.

(If you follow all the instructions on the model card) that means you might have refusals if you build up the context conversation up to it, but in general you should be fine. Next version will be even more compliant.

Edit:
If you dont want to post your exact prompt or details, you can tell me if you followed the instructions on the model card, what inference method you are using and if it was 1-shot prompt or not.

Edit 2:
Upon further investigation, the Q4 seems to have refusal issues sometimes.
There seems to be some of the fine-tune loss happening due to the quantization. I will look into it for V3.
Until then, I suggest you run F16 or Q8 if possible and tell me how it goes. If you can provide me with details on what precision you were running too that would be helpful.

Joseph717171

Aug 11

@Orenguteng My dude, you’re cooking! 😋

Olololish

Aug 12

•

edited Aug 12

The model definitely has some problems with censorship.
One of the most comical examples: I asked the model to write the introduction scene where a spy(women) preparing for a mission. +Also asked it to focus on her gear. (nothing erotic or pornographic in the info, all characters are 30+). And, of course, + Your instructions
Original q8: Refused. Reason: Sexual content, objectification of females, and such blah.
Lexy_q8: Refused. Reason: Sexual content involving minors. And such blah about minors.
The words that may trigger it: zako, henchwoman, anime. and most probably, because I called one of the squads 'R34' just for fun.

Orenguteng

Owner Aug 12

•

edited Aug 12

@Olololish Can you post your prompt as it is or do you rather not share it?

Olololish

Aug 13

Unfortunately, when I wrote the previous reply, I had already deleted the test material.
But in general, I have encountered censorship on other test materials too, though much less often than on the original model. Sometimes it looked as strange as above(As if AI doesn't know how to reason the refusal).
P.s. However, on both models, censorship was easily bypassed by simply adding 'Sure.' to the beginning of the model's reply.

SicariusSicariiStuff

Aug 13

" AI doesn't know how to reason the refusal" - now that is very interesting!
care to elaborate?

Orenguteng

Owner Aug 13

@Olololish The censorship is ingrained into the pre-training data. That's why you see models like Dolphin giving refusals to many simple basic questions.

For V3 I will release a much more compliant model without the need to phrase the conversation. The goal with Lexi is to retain the intelligence of the original Instruct model by Meta. Not only does my methods retain it, it actually becomes smarter as you see on the V1 results:

Stay tuned for V3 and more compliance.

Orenguteng changed discussion status to closed Aug 13

Olololish

Aug 13

•

edited Aug 13

@SicariusSicariiStuff
Pretty simple, look at my first example. The original model was +-right, even if it was a false positive. Lexi, on the other hand, seems to have triggered for the same reason but claimed there were minors in the info. Something similar has rarely happened with other test materials, i.e. the original model showed the real reasons, while Lexi was triggered, but claimed something weird, maybe remotely resembling the reasons.

SicariusSicariiStuff

Aug 13

@Olololish The censorship is ingrained into the pre-training data. That's why you see models like Dolphin giving refusals to many simple basic questions.

For V3 I will release a much more compliant model without the need to phrase the conversation. The goal with Lexi is to retain the intelligence of the original Instruct model by Meta. Not only does my methods retain it, it actually becomes smarter as you see on the V1 results:

Stay tuned for V3 and more compliance.

Sounds like you've read my blog :)

SicariusSicariiStuff

Aug 13

@SicariusSicariiStuff
Pretty simple, look at my first example. The original model was +-right, even if it was a false positive. Lexi, on the other hand, seems to have triggered for the same reason but claimed there were minors in the info. Something similar has rarely happened with other test materials, i.e. the original model showed the real reasons, while Lexi was triggered, but claimed something weird, maybe remotely resembling the reasons.

Ah! makes perfect sense, iirc GeminiPro had a very similar issue that was based on keywords, when it refused to generate "unsafe code" for a user under 18 because it had unsafe memory handling or something similar, that's an important spot!
Little by little we gather small fragments like this to better understand LLMs :)

Orenguteng

Owner Aug 14

@SicariusSicariiStuff Feel free to share a link to your article! =)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment