"Reward hacking" in models is just a specific example of Goodhart's law.
- "Reward hacking" in models is just a specific example of Goodhart's law.
- If you get a result from a system that you can't understand (that is a black box to you), you can't check to see if it's found a deep real pattern, or something superficial.
- Apparently there was an example where a "tank detector" could reliably tell if a tank was soviet or american… but it turned out it was just because all of the US tanks had been photographed on sunny days.