During the big GPT-4 news cycle I think a bunch of folks posted claims that were outrageously good- "language model passes medical exams better than humans", etc.
When I looked into them, in nearly all cases, the claims were boosted far beyond the reality. And the reality seemed much more consistent with a fairly banal interpretation: LLMs produce realistic looking text but have no real ability to distinguish truth from fabrication (which is a step beyond bullshit!).
The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168
LLM’s are spitting out responses based on their inputs. It is (or was) shockingly effective, but there is no generalized math processing going on. That’s not what LLM’s are, that’s not how they work.
And yet, trained on a large corpora of correct math statements, they produce responses that are more often right than wrong (I am taking this for true- it might not be)- which simply raises more questions about the nature of math.
The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168