Alpha-zero showed how much learning could happen if you had a rigorous ground truth system.

2025-01-27 · Bits and Bobs 1/27/25

Alpha-zero showed how much learning could happen if you had a rigorous ground truth system^[adm].
- Games like Go have rigid, clear rules about what moves are legal and what constitutes a win.
  - The ground truth can be applied without a human in the loop, because the rules are black and white and possible to easily model in a computer with full fidelity.
- That means if you set up a co-evolutionary loop, you can pour extraordinary amounts of compute into it and it will get better and better, with no humans in the loop.
  - A self-catalyzing infinite stream of training data.
- That hasn't worked for things like reasoning yet because there's no ground truth you can efficiently compare against^[adn]^[ado].
- But now GPT4-class models are commodity and there are a number of open weights versions.
- Those models can act like the "ground truth" for other models to use to bootstrap off of… it just requires conveniently ignoring the license^[adp] of the open weights models.
- Some of the recent breakthroughs likely happened in this way.
  - "We just released an MIT-licensed Llama-derived model."
  - "Wait, what?"^[adq]
- It's impossible to imagine that this can be stopped, it's too powerful a technique, and too easy for someone to have a licensing "oopsie".
  - Once the weights are published, there's no taking them back.
  - By the time the original publisher of the infringing model is taken down (which might take a long time, especially if they're international) that derivative model has been picked up by the swarm.^[adr]

Alpha-zero showed how much learning could happen if you had a rigorous ground truth system.

More on this topic