@m-ric on Hugging Face: "🚨 𝗛𝘂𝗺𝗮𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗳𝗼𝗿 𝗔𝗜 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴: 𝗡𝗼𝘁 𝘁𝗵𝗲…"

Post

793

🚨 𝗛𝘂𝗺𝗮𝗻 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗳𝗼𝗿 𝗔𝗜 𝘁𝗿𝗮𝗶𝗻𝗶𝗻𝗴: 𝗡𝗼𝘁 𝘁𝗵𝗲 𝗴𝗼𝗹𝗱𝗲𝗻 𝗴𝗼𝗼𝘀𝗲 𝘄𝗲 𝘁𝗵𝗼𝘂𝗴𝗵𝘁?

I’ve just read a great paper where Cohere researchers raises significant questions about using Human feedback to evaluate AI language models.

Human feedback is often regarded as the gold standard for judging AI performance, but it turns out, it might be more like fool's gold : the study reveals that our human judgments are easily swayed by factors that have nothing to do with actual AI performance.

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:
🧠 Test several models: Llama-2, Falcon-40B, Cohere Command 6 and 52B 🙅‍♂️ Refusing to answer tanks AI ratings more than getting facts wrong. We apparently prefer a wrong answer to no answer!

💪 Confidence is key (even when it shouldn't be): More assertive AI responses are seen as more factual, even when they're not. This could be pushing AI development in the wrong direction, with systems like RLHF.

🎭 The assertiveness trap: As AI responses get more confident-sounding, non-expert annotators become less likely to notice when they're wrong or inconsistent.

And a consequence of the above:
🔄 𝗥𝗟𝗛𝗙 𝗺𝗶𝗴𝗵𝘁 𝗯𝗮𝗰𝗸𝗳𝗶𝗿𝗲: Using human feedback to train AI (Reinforcement Learning from Human Feedback) could accidentally make AI more overconfident and less accurate.

This paper means we need to think carefully about how we evaluate and train AI systems to ensure we're rewarding correctness over apparences of it like confident talk.

⛔️ Chatbot Arena’s ELO leaderboard, based on crowdsourced answers from average joes like you and me, might become completely irrelevant as models will become smarter and smarter.

Read the paper 👉 Human Feedback is not Gold Standard (2309.16349)

Join the conversation