It's very hard to teach LLMs to be good at math.
As with the "Which is bigger: 9.9 or 9.11" question from a few days ago that led Andrej to his "jagged intelligence" tweet.
LLMs are like system 1: a general approach that is surprisingly good at vibes-based processing for everything, but not particularly great at anything.
But there are lots of sub problems that can be resiliently correct, a kind of niche system 2.
For example, plain old computers are extremely good at arithmetic.
Why try to brute force everything with a system 1 type approach?
Why not plug in adjacent systems for some subdomains like arithmetic, with a function-calling style interface?