it was basically a mindless loop, very prime for being agent driven:
- observe error rate uptick
- maybe dig in with apm tooling
- read actual error messages
- compare what apm and logs said to last commit/deploy
- if they look even tangentially related deploy the previous commit (aka revert)
- if its still not fixed do a "debug push", basically stuff a bunch of print statements (or you can do better) around the problem to get more info
I won't say that solves every case but definitely 90% of them.
I think your point about preserving some amount of intent/context is good, but also like what are most of us doing with agents if not "loop on error message until it goes away".
I think your point about preserving some amount of intent/context is good, but also like what are most of us doing with agents if not "loop on error message until it goes away".