Thank you! I'll see about building a test suite. Do you compare models' output s...

pants2 · 2025-12-05T16:53:51 1764953631

Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.

dotancohen · 2025-12-05T17:30:50 1764955850

Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.

pants2 · 2025-12-05T19:44:54 1764963894

Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help