Agents will optimize for the thing they get evaluated on.
- Agents will optimize for the thing they get evaluated on.
- For any collective (of more than one agent) that must be different than the goal of the collective.
- In small, high-trust teams, the agent will be evaluated on the collective's output.
- In large, low-trust teams, the agent will be evaluated on something disjoint from the collective's goal.
- Goodhart's law arises from this misalignment.
- Agents want to maximize their own value (capping downside of getting fired, while maximizing upside of reward).