it was basically a mindless loop, very prime for being agent driven:

  - observe error rate uptick
  - maybe dig in with apm tooling
  - read actual error messages
  - compare what apm and logs said to last commit/deploy
  - if they look even tangentially related deploy the previous commit (aka revert)
  - if its still not fixed do a "debug push", basically stuff a bunch of print statements (or you can do better) around the problem to get more info

I won't say that solves every case but definitely 90% of them.

I think your point about preserving some amount of intent/context is good, but also like what are most of us doing with agents if not "loop on error message until it goes away".