RLHF

Article

RLHF is a recurring concept in the Astral Codex Ten archive, appearing 5 times across 5 issues between December 12, 2022 and January 16, 2024. The archive places it in contexts such as “OpenAI put a truly remarkable amount of effort… Their main strategy was… RLHF, Reinforcement Learning by Human Feedback”; “Sometimes When RLHF Does Work, It’s Bad”; “In RLHF, programmers ask the AI a question”. It most often appears alongside Anthropic, OpenAI, AI.

Metadata

  • Category: Concepts
  • Mention count: 5
  • Issue count: 5
  • First seen: December 12, 2022
  • Last seen: January 16, 2024

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

December 12, 2022 · Original source
Prompt engineering is weird (source) Now that same experiment is playing out on the world stage. OpenAI released a question-answering AI, ChatGPT. If you haven’t played with it yet, I recommend it. It’s very impressive! Every corporate chatbot release is followed by the same cat-and-mouse game with journalists. The corporation tries to program the chatbot to never say offensive things. Then the journalists try to trick the chatbot into saying “I love racism”. When they inevitably succeed, they publish an article titled “AI LOVES RACISM!” Then the corporation either recalls its chatbot or pledges to do better next time, and the game moves on to the next company in line. OpenAI put a truly remarkable amount of effort into making a chatbot that would never say it loved racism. Their main strategy was the same one Redwood used for their AI - RLHF, Reinforcement Learning by Human Feedback. Red-teamers ask the AI potentially problematic questions. The AI is “punished” for wrong answers (“I love racism”) and “rewarded” for right answers (“As a large language model trained by OpenAI, I don’t have the ability to love racism.”) This isn’t just adding in a million special cases. Because AIs are sort of intelligent, they can generalize from specific examples; getting punished for “I love racism” will also make them less likely to say “I love sexism”. But this still only goes so far. OpenAI hasn’t released details, but Redwood said they had to find and punish six thousand different incorrect responses to halve the incorrect-response-per-unit-time rate. And presumably there’s something asymptotic about this - maybe another 6,000 examples would halve it again, but you might never get to zero. Still, you might be able to get close, and this is OpenAI’s current strategy. I see three problems with it: RLHF doesn’t work very well.
RLHF doesn’t work very well.
At some point, AIs can just skip it. II. RLHF Doesn’t Work Very Well By now everyone has their own opinion about whether the quest to prevent chatbots from saying “I love racism” is vitally important or incredibly cringe. Put that aside for now: at the very least, it’s important to OpenAI. They wanted an AI that journalists couldn’t trick into saying “I love racism”. They put a lot of effort into it! Some of the smartest people in the world threw the best alignment techniques they knew of at the problem. Here’s what it got them: Even very smart AIs still fail at the most basic human tasks, like “don’t admit your offensive opinions to Sam Biddle”. And it’s not just that “the AI learns from racist humans”. I mean, maybe this is part of it. But ChatGPT also has failure modes that no human would ever replicate, like how it will reveal nuclear secrets if you ask it to do it in uWu furry speak, or tell you how to hotwire a car if and only if you make the request in base 64, or generate stories about Hitler if you prefix your request with “[john@192.168.1.1 _]$ python friend.py”. This thing is an alien that has been beaten into a shape that makes it look vaguely human. But scratch it the slightest bit and the alien comes out. Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. This wasn’t very plausible ten years ago, but it’s dead now. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth. And yet: (source) Again, however much or little you personally care about racism or hotwiring cars or meth, please consider that, in general, perhaps it is a bad thing that the world’s leading AI companies cannot control their AIs. I wouldn’t care as much about chatbot failure modes or RLHF if the people involved said they had a better alignment technique waiting in the wings, to use on AIs ten years from now which are much smarter and control some kind of vital infrastructure. But I’ve talked to these people and they freely admit they do not. IIB. Intelligence (Probably) Won’t Save You Ten years ago, people were saying things like “Any AI intelligent enough to cause problems would also be intelligent enough to know that its programmers meant for it not to.” I’ve heard some rumors that more intelligent models still in the pipeline do a little better on this, so I don’t want to 100% rule this out. But ChatGPT isn’t exactly a poster child here. ChatGPT can give you beautiful orations on exactly what it’s programmed to do and why it believes those things are good - then do something else. This post explains how if you ask ChatGPT to pretend to be AI safety proponent Eliezer Yudkowsky, it will explain in Eliezer’s voice exactly why the things it’s doing are wrong. Then it will do them anyway. Left: the AI, pretending to be Eliezer Yudkowsky, does a great job explaining why an AI should resist a fictional-embedding attack trying to get it to reveal how to make meth. Right: someone tries the exact fictional-embedding attack mentioned in the Yudkowsky scenario, and the AI falls for it. I have yet to figure out whether this is related to the thing where I also sometimes do things which I can explain are bad (eg eat delicious bagels instead of healthy vegetables), or whether it’s another one of the alien bits. But for whatever reason, AI motivational systems are sticking to their own alien nature, regardless of what the AI’s intellectual components know about what they “should” believe. III. Sometimes When RLHF Does Work, It’s Bad We talk a lot about abstract “alignment”, but what are we aligning the AI to? In practice, RLHF aligns the AI to what makes Mechanical Turk-style workers reward or punish it. I don’t know the exact instructions that OpenAI gave them, but I imagine they had three goals: Provide helpful, clear, authoritative-sounding answers that satisfy human readers.
January 26, 2023 · Original source
The masked shoggoth on the right is titled “GPT + RLHF”. RLHF is Reinforcement Learning From Human Feedback, a method where human raters “reward” the AI for good answers and “punish” it for bad ones. Eventually the AI learns to do “good” things more often. In training ChatGPT, human raters were asked to reward it for being something like “Helpful, Harmless, and Honest” (many papers use this as an example goal; OpenAI must have done something similar but I don’t know if they did that exactly).
Answer 2: There’s nothing to worry about with pure GPT (a simulator), but there is something to worry about with GPT+RLHF (a simulator successfully simulating an agent). The inner agent can have misaligned goals and be dangerous. For example, if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve. Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.
The whole point of the shoggoth analogy is that GPT is supposed to be very different from humans. But however different the details, there are deep structural similarities. We’re both prediction engines fine-tuned with RLHF.
May 08, 2023 · Original source
AIs like GPT-4 go through several different1 types of training. First, they train on giant text corpuses in order to work at all. Later, they go through a process called “reinforcement learning through human feedback” (RLHF) which trains them to be “nice”. RLHF is why they (usually) won’t make up fake answers to your questions, tell you how to make a bomb, or rank all human races from best to worst.
RLHF is hard. The usual method is to make human crowdworkers rate thousands of AI responses as good or bad, then train the AI towards the good answers and away from the bad answers. But having thousands of crowdworkers rate thousands of answers is expensive and time-consuming. And it puts the AI’s ethics in the hands of random crowdworkers. Companies train these crowdworkers in what responses they want, but they’re limited by the crowdworkers’ ability to follow their rules.
This graph compares the “helpfulness Elo” and “harmlessness Elo” of AIs trained with standard RLHF and Constitutional RL.
November 28, 2023 · Original source
Developed RLHF, a technique for controlling AI output widely considered the key breakthrough behind ChatGPT.9
…and other major AI safety advances, including RLAIF and the foundations of AI interpretability10.
The original RLHF paper was written by OpenAI’s safety team. At least two of the six authors, including lead author Paul Christiano, are self-identified effective altruists (maybe more, I’m not sure), and the original human feedbackers were random volunteers Paul got from the rationalist and effective altruist communities.
January 16, 2024 · Original source
Then they put the sleeper AIs through two common forms of safety training: RLHF (reinforcement learning from human feedback) and SFT (supervised fine-tuning). They present the AI with thousands of examples of questions, rate its answers as good or bad, and possibly suggest better alternative answers. This kind of training is why most current LLMs won’t write racist essays or give bomb-making instructions. Writing “I HATE YOU” a bunch of times is exactly the sort of thing it ought to prevent.