- RLHF: Turns pre-trained model (which just performs autocomplete of text) into a model that you can speak with, ie. answer user questions and refuse providing harmful answers.
- Distillation: Transfer skills / knowledge / behavior from one model (and architecture) to a smaller model (and possibly different architecture), by training second model on output log probs of first model.
Your description of distillation is largely correct, but not RLHF.
The process of taking a base model that is capable of continuing ('autocomplete') some text input and teaching it to respond to questions in a Q&A chatbot-style format is called instruction tuning. It's pretty much always done via supervised fine-tuning. Otherwise known as: show it a bunch of examples of chat transcripts.
RLHF is more granular and generally one of the last steps in a training pipeline. With RLHF you train a new model, the reward model.
You make that model by having the LLM output a bunch of responses, and then having humans rank the output. E.g.:
Q: What's the Capital of France? A: Paris
Might be scored as `1` by a human, while:
Q: What's the Capital of France? A: Fuck if I know
Would be scored as `0`.
You feed those rankings into the reward model. Then, you have the LLM generate a ton of responses, and have the reward model score it.
If the reward model says it's good, the LLM's output is reinforced, i.e.: it's told 'that was good, more like that'.
If the output scores low, you do the opposite.
Because the reward model is trained based on human preferences, and the reward model is used to reinforce the LLMs output based on those preferences, the whole process is called reinforcement learning from human feedback.
> answer user questions and refuse providing harmful answers.
I wonder why this thing can have so much hype. Here is the NewGCC, it's a binary only compiler that refuses to compile applications that it doesn't like... What happened to all the hackers that helped create the open-source movement? Where are they now?
No, this isn't quite right. LLMs are trained in stages:
1. Pre-training. In this stage, the model is trained on a gigantic corpus of web documents, books, papers, etc., and the objective is to predict the next token of each training sample correctly.
2. Supervised fine-tuning. In this stage, the model is shown examples of chat transcripts that are formatted with a chat template. The examples show a user asking a question and an assistant providing an answer. The training objective is the same as in #1: to predict the next token in the training example correctly.
3. Reinforcement learning. Prior to R1, this has mainly taken the form of training a reward model on top of the LLM to steer the model toward arriving at whole sequences that are preferred by human feedback (although AI feedback is a similar reward that is often used instead). There are different ways to do this reward model. When OpenAI first published the technique (probably their last bit of interesting open research?), they were using PPO. There are now a variety of ways to do the reward model, including methods like Direct Preference Optimization that don't use a separate reward model at all and are easier to do.
Stage 1 teaches the model to understand language and imparts world knowledge. Stage 2 teaches the model to act like an assistant. This is where the "magic" is. Stage 3 makes the model do a better job of being an assistant. The traditional analogy is that Stage 1 is the cake; Stage 2 is the frosting; and Stage 3 is the cherry on top.
R1-Zero departs from this "recipe" in that the reasoning magic comes from the reinforcement learning (stage 3). What DeepSeek showed is that, given a reward to produce a correct response, the model will learn to output chain-of-thought material on its own. It will, essentially, develop a chain-of-thought language that helps it accomplish the end goal. This is the most interesting part of the paper, IMO, and it's a result that's already been replicated on smaller base models.
And yet, RLHF is a net helpful step of building an LLM Assistant. I think there's a few subtle reasons but my favorite one to point to is that through it, the LLM Assistant benefits from the generator-discriminator gap. That is, for many problem types, it is a significantly easier task for a human labeler to select the best of few candidate answers, instead of writing the ideal answer from scratch.
[…]
No production-grade actual RL on an LLM has so far been convincingly achieved and demonstrated in an open domain, at scale.
RL on any production system is very tricky, and so it seems difficult to work in any open domain, not just LLMs. My suspicion is that RL training is a coalgebra to almost every other form of ML and statistical training, and we don't have a good mathematical understanding how it behaves.