I've never found myself coding undo actions, because it seems that if a forward ...

anyfoo · on May 30, 2023

"I've never found myself packing a backup parachute, because if one parachute can fail, so can its backup one. And now you've got two things to worry about."

mrkeen · on June 4, 2023

Correct. I stay on the ground.

paulddraper · on May 30, 2023

You must for transactions (like, financial transitions).

Let's say you're making a plane booking service (kiwi.com clone).

You charge the customer, then book the flight. But if booking the flight fails, you must refund the customer.

mrkeen · on June 4, 2023

I'm not saying it's not necessary; I'm saying it's not sufficient.

    try {
       pay();
    } catch {
       try {
          refund();
       } catch {
          // "must refund the customer" implies we can't reach here
       }
    }

lorendsr · on May 30, 2023

At some point, if you can't automatically fix something, you have to stop and report to a human for manual intervention/repair. While a saga doesn't guarantee that you avoid manual repair, it significantly reduces the need for it. If each of these has a 1% chance of non-retryable failure:

Step1

Step2

Step1Undo

then this has a 1% chance of needing manual repair (it's okay if step1 fails, but if step1 succeeds and step2 fails, we need to repair):

do Step1

do Step2

and this has a .01% chance (we only repair if Step2 and Step1Undo fails, 1% * 1%):

do Step1

try {

  do Step2

} catch {

  do Step1Undo

}

nivertech · on May 30, 2023

There is also the case when Step1 was successfull, but the Saga Orchestrator (or Saga participant in case of Choreography) for some reason (like communication error) doesn't know about it.

In case Step1's service doesn't expose an API to poll its status, then the only recourse is to execute it again (with the same input key, assuming it's idempotent ;)

azurelake · on May 30, 2023

There's nothing to debug because a failure during a saga is a totally reasonable and expected thing to happen. Take the example in the article.

1. You book a flight. You successfully reserve a seat.

2. You book a car. You successfully reserve a sedan.

3. You try to book a hotel room. The room that you wanted was booked while you were booking your flight, and there aren't any more available.

You obviously don't want the car or flight anymore, and you want to cancel them without a human having to manually fix it.

richdougherty · on May 30, 2023

I think mrkeen is talking about a failure when handling a failure. E.g. when a cancellation step fails, what do you do?

The answer is, you model those as well and work out what to do. But it's more messy than you might think if you just model the first-order failure paths.

azurelake · on May 30, 2023

I disagree, and here's why. There are basically two reasons a cancellation step would fail.

1. A misunderstanding of the business rules. In the flight example, you thought that were flights were cancellable, but actually the airline only offers nonrefundable seats.

2. System type errors, e.g. network outages.

If you get a type 1 failure, that's an error that gets ingested in your error monitoring service, and is a bug that needs to be fixed. If you get a type 2 failure, idempotent cancellation (which is necessary for this work) will eventually get you to your desired state. Either way, you shouldn't need to model deeper into the state graph.

mrkeen · on May 30, 2023

> If you get a type 2 failure, idempotent cancellation (which is necessary for this work)

That would have been a good article. The saga pattern could have just been a footnote to it.

fortunaTemporal · on May 30, 2023

Here's the mind-blowing thing: Temporal handles those type 2 failures for you. So they are the footnote, and then the saga pattern can take up the whole article

sublinear · on May 30, 2023

You should be able to abort everything at any time and still revert to the old state regardless of external service failures. Even if the database went down you have the initial state queued to be restored when it's back up.

Instead of untangling the mess, just cut the gordian knot and throw a nice error of what failed and what was aborted.

fortunaTemporal · on May 30, 2023

But in the scenario that the Saga pattern handles, you have at least TWO databases, and multiple processes can be modifying them in the meanwhile. It IS a gordian knot and you don't have a known clear place to restore from.

mrkeen · on May 30, 2023

Example?

segmondy · on May 30, 2023

Your undo operations will be very simple.

So instead of having a complex logic. Have a simple lambda function that talks to a queue. That's it. It accepts an undo command. You read a command you stuff it in the queue done. No DB, No servers. If you were running this yourself. You will have a simple API (distributed) that does the same to a distributed queue/cache. Done.

Your complex job can now pick up the undo commands from the queue and execute with logic to retry if for some reason it fails.

joesb · on June 1, 2023

It's not completely about handling unplanned failure, but handling alternative path when the condition for one path is not met. For example, when you perform `withdrawMoney()` it can fail because there's not enough money in the account. This has nothing to do with your coding failure.

If you have if/else in your code, you don't think "If one path fail, so can the other path, so I never handle the other path".

sublinear · on May 30, 2023

Undo actions are not necessarily about mitigating failure, but just getting to another state in the state machine. Not all failed forward steps are bugs nor always need a retry before an undo.

Regardless of whatever happens, a failed transaction state should always be possible without affecting data integrity.

liampulles · on May 31, 2023

The undo will hopefully catch the slim case when a rollback is needed. If the undo fails (slim slim case) then you flag for a human.

It's just an act of trying to automate the resolution of error scenarios to reduce human effort.

koromak · on May 30, 2023

I think in some situations you have to try.

Billing is one, for example.