Some AI video output is mind-bendingly bizarre.

  • Some AI video output is mind-bendingly bizarre.
    • For example, some of the examples in this model shootout where the different models are compared on how they handle the same query about cutting a steak.
    • The AI video output looks totally reasonable on a given frame, but as the video plays and the AI has to try to make sense of potentially ambiguous or weird details, it sometimes resolves them with impossible, unrealistic solutions.
      • An example I experienced this week: a video of a Christmas village from a birds' eye view, with bokeh around points of christmas lights down below.
      • So far, so good.
      • Now, the camera dollies forward through the sky, towards the village.
      • The model didn't realize that the big balls were bokeh that should stick to the physical location they are emanating from.
      • Instead, it interpreted them as giant floating light emitting orbs over the village.
      • Bizarre!
    • When you watch the model make weird decisions about the world in the video, it gives a very disconcerting vibe.
    • It's like the model is trying desperately to make sense of the world depicted in the frame, and sometimes making weird decisions.
    • This, by the way, is how human minds work too.
    • Our minds are constantly trying to predict what they'll experience next, by building up an implicit model of the world.
    • Sometimes our brain guesses wrong and then later more signal comes in that requires our brain to snap to a different mental model.
      • Various optical illusions trigger this reliably.
    • When it happens, there's a kind of whooshing vertigo feeling as the whole world reorients around you… but nothing visually changes.
      • Kind of like the dolly zoom camera move Jaws made famous.
      • A "wait what is even happening" kind of disconcerting effect.
    • We're trying to make sense of an actual physical reality that has certain constraints, so the visual field doesn't change in that moment, just our interpretation of it.
    • The AI is trying to simulate a coherent reality, so when it makes a bad implicit world model choice, it leads directly to odd, unrealistic visual artifacts.
      • For humans, we have tons of experience with the real world, and also the physical world is primary and our perception of it is secondary.
      • For AI video models, the visual perception of it is primary and the world model is secondary.
      • AI video models also have much less ground truth experience in the real world than humans do.
    • Watching when the model makes a weird interpretation that goes against your expectation gives that same disconcerting world model swapping feeling.

More on this topic

From other episodes