This is actively being worked on my pretty much every major provider. It was the subject of that recent OpenAI paper on hallucinations. It's mostly caused by benchmarks that reward correct answers, but don't penalize bad answers more than simply not answering.
E.g.
Most current benchmarks have a scoring scheme of
1 - Correct Answer
0 - No answer or incorrect answer
E.g.
Most current benchmarks have a scoring scheme of
1 - Correct Answer 0 - No answer or incorrect answer
But what they need is something more like
1 - Correct Answer 0.25 - No answer 0 - Incorrect answer
You need benchmarks (particularly those used in training) to incentivize the models to acknowledge when they're uncertain.