Models can get perfectly good at games but not real objectives.

2025-11-04 · Bits and Bobs 11/4/25

Models can get perfectly good at games but not real objectives.
- Games have an unhackable reward function, because the metric is precisely the ground reality.
- RLHF quality is only a proxy for real usefulness.
- So the model reward hacks, as any optimizing process must do.
- Goodhart's law strikes again!
- Games are unlike real objectives in that they are inherently artificial and constructed, a little pocket of reality with precisely defined rules and goals.
- If the rules say the player won, they won.
- Compare that to an example where just because a business made a ton of profit doesn't mean they were on net good for society.