So it's been claimed, but has it been proven yet? I'm not even sure what is bein...

Nevermark · 2025-03-04T09:13:44 1741079624

You really can distill a model to a smaller one just by training on the first model's behavior.

The first model has a much harder problem. It must figure out a way to iron out the differences between all the different models (i.e. multiple humans, or noise in the dataset) it is trained on, to come up with its single viewpoint of relationships. It must then be tuned to optimally fit its purpose, instead of simply operating like a (composite) rando human.

The second model then has these advantages:

1. Trained on a single model, whose behavior comes out of consistent learned relationships, as apposed to many inconsistent models with noise (lots of different humans operating differently at any given time for reasons beyond what the data can show).

2. Trained on post-tuned model behavior directly. One-step training is a considerably simpler problem than two-step. It doesn't learning anything it will need to unlearn or modify.

3. These simplifications both reduce the necessary size of the second model.

And with these combined advantages (a simpler more consistent single-system to model, smaller size, and fewer training steps), it can be trained much faster.

There, of course, can be disadvantages. The distilled model may not be as versatile, if the data used to prompt the original model isn't as diverse as the first models data. So prompt data for training still matters.

demosthanos · 2025-03-10T16:25:43 1741623943

A bit late to reply, but none of this actually answers my question. I don't doubt distillation from behavior is possible, I doubt that it's possible when 90% of o1's behavior is never returned from the API. If the chain of thought process is what improves the results, then distillation without the chain of thought to train on should not produce comparable results.

ipaddr · 2025-03-02T17:19:08 1740935948

The outputting that they are chatGPT from deepseek is a big clue.