Video models can do zero-shot reasoning tasks.
- Video models can do zero-shot reasoning tasks.
- Here's Simon's excellent write-up.
- For example: render a maze,with a mouse at the start and cheese at the end.
- Then generate video frames.
- The mouse solves the maze to find the cheese.
- Chain-of-frame thinking.
- These are emergent capabilities of video models that imply a kind of internal world model.
- The world model is imperfect, but surprisingly strong just based on the brute force of feeding it tons of video.
- The easiest way to make a reasonable next frame of a video is to implicitly build a world model.