I think mrkeen is talking about a failure when handling a failure. E.g. when a cancellation step fails, what do you do?
The answer is, you model those as well and work out what to do. But it's more messy than you might think if you just model the first-order failure paths.
I disagree, and here's why. There are basically two reasons a cancellation step would fail.
1. A misunderstanding of the business rules. In the flight example, you thought that were flights were cancellable, but actually the airline only offers nonrefundable seats.
2. System type errors, e.g. network outages.
If you get a type 1 failure, that's an error that gets ingested in your error monitoring service, and is a bug that needs to be fixed. If you get a type 2 failure, idempotent cancellation (which is necessary for this work) will eventually get you to your desired state. Either way, you shouldn't need to model deeper into the state graph.
Here's the mind-blowing thing: Temporal handles those type 2 failures for you. So they are the footnote, and then the saga pattern can take up the whole article
You should be able to abort everything at any time and still revert to the old state regardless of external service failures. Even if the database went down you have the initial state queued to be restored when it's back up.
Instead of untangling the mess, just cut the gordian knot and throw a nice error of what failed and what was aborted.
But in the scenario that the Saga pattern handles, you have at least TWO databases, and multiple processes can be modifying them in the meanwhile. It IS a gordian knot and you don't have a known clear place to restore from.
So instead of having a complex logic. Have a simple lambda function that talks to a queue. That's it. It accepts an undo command. You read a command you stuff it in the queue done. No DB, No servers. If you were running this yourself. You will have a simple API (distributed) that does the same to a distributed queue/cache. Done.
Your complex job can now pick up the undo commands from the queue and execute with logic to retry if for some reason it fails.
1. You book a flight. You successfully reserve a seat.
2. You book a car. You successfully reserve a sedan.
3. You try to book a hotel room. The room that you wanted was booked while you were booking your flight, and there aren't any more available.
You obviously don't want the car or flight anymore, and you want to cancel them without a human having to manually fix it.