Fully automated systems often deliver golden turds.
That is, an answer that is superficially great but actually bad for a subtle reason.
My mental model is a Simon Giertz-style ketchup robot.
After a few minutes of work the LLM agent plays a triumphant chime and happily delivers you… a steaming turd.
The longer an LLM chews on a problem without the guiding hand of a human, the more often it will produce these turds.
That's where LLMs that can give diffs inline while you're working are more helpful.
The human and the LLM can iterate together continuously, instead of the LLM going off in a cave by itself and getting increasingly lost.
In some domains, it's OK if every so often it delivers not a golden nugget but a gilded turd.
But in some domains (like law) there might be significant downside for a gilded turd.
And you might not realize it's giving you gilded turds until many years later, all the while pumping out more and more of them, erroneously thinking it's giving you golden nuggets.