If you peek into how multi-modal models work, they're a ball of cheap, random hacks that turn out to be terrifically, unreasonably effective.
Search engines aren't that different on the inside.
At first it seems random and unprincipled.
In a way, it is: we can't explain why this random hack works with anything like a grand theory.
But there is a logic and order to it; for every random hack that works, there are dozens that were tried and turned out to not work for some reason.
The hacks that work stick; the other ones are forgotten about.
What results is a seemingly arbitrary collection of hacks that just so happen to work.
There is a selection pressure you can't see, possible because of hill climbing: a clear objective metric to experiment with.
But they're hacks on top of hacks; almost certainly hitting a local maxima.