Developer here - its a good point that most of the models were not specifically ...

Developer here - its a good point that most of the models were not specifically trained on diagnostic imaging, with the exception of Llava-Med. We would love to compare against other models trained on diagnostic imaging if anyone can grant us access!

Comparison against human experts is the gold standard but information on human performance in the FRCR 2B Rapids examination is hard to come by - we've provided a reference (1) which shows comparable (at least numerically) performance of human radiologists.

To your point around people just out of training (keeping in mind that training for the FRCR takes 5 years, while doing practicing medicine in a real clinical setting) taking their first test - the reference shows that after passing the FRCR 2B Rapids the first time, their performance actually declines (at least in the first year), so I'm not sure if experts would do better over time.

1. https://www.bmj.com/content/bmj/379/bmj-2022-072826.full.pdf