Developer of the model here. We built this model in the form of an LLM precisely to address this problem - to be able to utilize the textual data that accompanies the image such as the order history or clinical background e.g. patient demographics. Images and text are both embedded into the conversation, meaning the LLM can in theory respond using both.
Of course, there are lots of remaining challenges around integration and actually getting access to these data sources e.g. the EMR systems, when trying to use this in practice.
My experience with working with hospital textual data is that, for the most part, it's either useless, or doesn't exist. The radiologist reading the image is expected to phone the specialist who requested the images to be red in order to figure out what to do with the image.
Hospital systems are atrocious for providing useful information anyways. They are often full of unnecessary / unimportant fields that the requesting side either doesn't know how to fill, or will fill with general nonsense just to get the request through the system.
It gets worse when it's DICOMs: the format itself is a mess. You never know where to look for the useful information. The information is often created accidentally, by some automated process that is completely broken, but doesn't create any visible artifacts for whoever handles the DICOM. Eg. the time information in the machine taking the image might be completely wrong, but it doesn't appear anywhere on the image, but then, say, the research needs to tell the patient's age... and is off by few decades.
Any attempt I've seen so far to run a study in a hospital would result in about 50% of collected information being discarded as completely worthless due to how it was acquired.
Radiologists have general knowledge about the system in which they operate. They can identify cases when information is bogus, while plausible. But this is often so much tied to the context of their work, there's no hope for there to be a practical automated solution for this any time soon. (And I'm talking about hospitals in well-to-do EU countries).
NB. It might sound like I'm trying to undermine your work, but what I'm actually trying to say is that the environment in which you want to automate things isn't ready to be automated. It's very similar to the self-driving cars: if we built road infrastructure differently, the task of automating driving could've been a lot easier, but because it's so random and so dependent on local context, it's just too hard to make useful automation.
Thanks for the comments. I’m well aware as I’m also a practicing radiologist! Some hospitals in Australia where I work do a good job of enforcing that radiology orders are sent with the appropriate metadata but I agree that is not the case around the world. Integration, as always, remains the hardest step.
PS genuinely appreciate the engagement and don’t see it as undermining.
Developer here - its a good point that most of the models were not specifically trained on diagnostic imaging, with the exception of Llava-Med. We would love to compare against other models trained on diagnostic imaging if anyone can grant us access!
Comparison against human experts is the gold standard but information on human performance in the FRCR 2B Rapids examination is hard to come by - we've provided a reference (1) which shows comparable (at least numerically) performance of human radiologists.
To your point around people just out of training (keeping in mind that training for the FRCR takes 5 years, while doing practicing medicine in a real clinical setting) taking their first test - the reference shows that after passing the FRCR 2B Rapids the first time, their performance actually declines (at least in the first year), so I'm not sure if experts would do better over time.
One of the developers here. The paper links to an earlier model from a different group that could only interpret X-rays of specific body parts. Our model does not have such limitation.
However, the actual FRCR 2B Rapids exam question bank is not publicly available and the FRCR is unlikely to agree to release them as this would compromise the integrity of their examination in the future- so the test used are mock examinations, none of which have been provided to the model during training.
Of course, there are lots of remaining challenges around integration and actually getting access to these data sources e.g. the EMR systems, when trying to use this in practice.