Video models can do zero-shot reasoning tasks.

2025-10-06 · Bits and Bobs 10/6/25

Video models can do zero-shot reasoning tasks.
- Here's Simon's excellent write-up.
- For example: render a maze,with a mouse at the start and cheese at the end.
- Then generate video frames.
- The mouse solves the maze to find the cheese.
- Chain-of-frame thinking.
- These are emergent capabilities of video models that imply a kind of internal world model.
  - The world model is imperfect, but surprisingly strong just based on the brute force of feeding it tons of video.
  - The easiest way to make a reasonable next frame of a video is to implicitly build a world model.