Models can get perfectly good at games but not real objectives.
Games have an unhackable reward function, because the metric is precisely the ground reality.
RLHF quality is only a proxy for real usefulness.
So the model reward hacks, as any optimizing process must do.
Goodhart's law strikes again!
Games are unlike real objectives in that they are inherently artificial and constructed, a little pocket of reality with precisely defined rules and goals.
If the rules say the player won, they won.
Compare that to an example where just because a business made a ton of profit doesn't mean they were on net good for society.