I resonated with this argument about focusing on agent's reliability vs capability.

2025-04-14 · Bits and Bobs 4/14/25

I resonated with this argument about focusing on agent's reliability vs capability.
- The gee whiz demos are ones of capability, e.g. "book me a flight."
- But the user value of that capability is highly dependent on its reliability.
- If the automation fails, it often takes more time than it would have if you hadn't used it in the first place.
  - You invested time to configure and execute the automation.
  - When it fails, you have some amount of time and effort to diagnose what went wrong and what you would need to do to fix or unwind it.
  - You now need to do the task manually anyway.
- Let's analyze a hypothetical use case.
  - The use case takes 10 minutes to do manually.
  - If the automation works, it takes 5 minutes.
  - If the automation fails, the whole use case takes 20 minutes.
    - 5 minutes to execute the automation.
    - 5 minutes to diagnose the problem.
    - 10 minutes to do the task manually.
  - The automation has a 60% success rate.
  - The expected time of using the automation is 11 minutes.
    - This is longer than the 10 minutes to just do it yourself.
    - The automation is under water.
    - Over time, as more people try it and fail, and update their priors for the success rate (seeing how successful it was for them in the past, or for their friends or other users), over time the expected use of the underwater automation is 0.
- The three terms that can vary are:
  - What percentage of task time is saved if the automation works?
  - What percentage of task time is lost if the automation fails?
  - What is the success rate?
- The gee whiz use cases tend to actually be underwater: there are a lot of steps, all of which must work correctly in sequence, for the automation to fully work.
- The simple, dependable cases are often viable, and from there you can grow into more and more complex scenarios as the system improves.^[qi]