Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you! I'll see about building a test suite.

Do you compare models' output subjectively, manually? Or do you have some objective measures? My use case would be to test diagnostic information summaries - the output is free text, not structured. The only way I can think to automate that would be with another LLM.

Advice welcome!





Yeah - things are easy when you can objectively score an output, otherwise as you said you'll probably need another LLM to score it. For summaries you can try to make that somewhat more objective, like length and "8/10 key points are covered in this summary."

This is a real training method (like Group Relative Policy Optimization), so it's a legitimate approach.


Thank you. I will google Group Relative Policy Optimization to learn about that and the other training methods. If you have any resources handy that I should be reading, that would be appreciated as well. Have a great weekend.

Nothing off the top of my head! If you find anything good let me know. GRPO is a training technique likely not exactly what you'd do for benchmarking, but it's interesting to read about anyway. Glad I cuold help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: