If you peek into how multi-modal models work, they're a ball of cheap, random hacks that turn out to be terrifically, unreasonably effective.

Search engines aren't that different on the inside.

At first it seems random and unprincipled.

In a way, it is: we can't explain why this random hack works with anything like a grand theory.

But there is a logic and order to it; for every random hack that works, there are dozens that were tried and turned out to not work for some reason.

The hacks that work stick; the other ones are forgotten about.

What results is a seemingly arbitrary collection of hacks that just so happen to work.

There is a selection pressure you can't see, possible because of hill climbing: a clear objective metric to experiment with.

But they're hacks on top of hacks; almost certainly hitting a local maxima.

More on this topic