"Reward hacking" in models is just a specific example of Goodhart's law.

· Bits and Bobs 5/19/25
  • "Reward hacking" in models is just a specific example of Goodhart's law.
    • If you get a result from a system that you can't understand (that is a black box to you), you can't check to see if it's found a deep real pattern, or something superficial.
    • Apparently there was an example where a "tank detector" could reliably tell if a tank was soviet or american… but it turned out it was just because all of the US tanks had been photographed on sunny days.

More on this topic

From other episodes