It feels kind of crazy to me that AlphaFold works.
But maybe the reason AlphaFold works isn't that unrelated to why transformers are good at images.
The easiest way to predict which way an image is oriented is by developing a world model that picks up on subtle cues that humans would have a hard time even describing.
The easiest way to predict which way the protein will fold is by developing a world model that picks up on subtle clues that humans would have a hard time even describing.
It's hard for our brains to handle more than 2 dimensions.
Apparently DeepMind decided to tackle protein synthesis when they heard there was a game to predict folding of proteins that humans could play.
For pattern recognition, if humans can do it transformers can do it.