Some AI video output is mind-bendingly bizarre.
- Some AI video output is mind-bendingly bizarre.
- For example, some of the examples in this model shootout where the different models are compared on how they handle the same query about cutting a steak.
- The AI video output looks totally reasonable on a given frame, but as the video plays and the AI has to try to make sense of potentially ambiguous or weird details, it sometimes resolves them with impossible, unrealistic solutions.
- An example I experienced this week: a video of a Christmas village from a birds' eye view, with bokeh around points of christmas lights down below.
- So far, so good.
- Now, the camera dollies forward through the sky, towards the village.
- The model didn't realize that the big balls were bokeh that should stick to the physical location they are emanating from.
- Instead, it interpreted them as giant floating light emitting orbs over the village.
- Bizarre!
- When you watch the model make weird decisions about the world in the video, it gives a very disconcerting vibe.
- It's like the model is trying desperately to make sense of the world depicted in the frame, and sometimes making weird decisions.
- This, by the way, is how human minds work too.
- Our minds are constantly trying to predict what they'll experience next, by building up an implicit model of the world.
- Sometimes our brain guesses wrong and then later more signal comes in that requires our brain to snap to a different mental model.
- Various optical illusions trigger this reliably.
- When it happens, there's a kind of whooshing vertigo feeling as the whole world reorients around you… but nothing visually changes.
- Kind of like the dolly zoom camera move Jaws made famous.
- A "wait what is even happening" kind of disconcerting effect.
- We're trying to make sense of an actual physical reality that has certain constraints, so the visual field doesn't change in that moment, just our interpretation of it.
- The AI is trying to simulate a coherent reality, so when it makes a bad implicit world model choice, it leads directly to odd, unrealistic visual artifacts.
- For humans, we have tons of experience with the real world, and also the physical world is primary and our perception of it is secondary.
- For AI video models, the visual perception of it is primary and the world model is secondary.
- AI video models also have much less ground truth experience in the real world than humans do.
- Watching when the model makes a weird interpretation that goes against your expectation gives that same disconcerting world model swapping feeling.