Video models can do zero-shot reasoning tasks.

· Bits and Bobs 10/6/25
  • Video models can do zero-shot reasoning tasks.
    • For example: render a maze,with a mouse at the start and cheese at the end.
    • Then generate video frames.
    • The mouse solves the maze to find the cheese.
    • Chain-of-frame thinking.
    • These are emergent capabilities of video models that imply a kind of internal world model.
      • The world model is imperfect, but surprisingly strong just based on the brute force of feeding it tons of video.
      • The easiest way to make a reasonable next frame of a video is to implicitly build a world model.