An introduction to Sagas
In a simple client-server architecture a service is typically capable of handling everything that is required to handle a command issued but in systems designed with a distributed architecture (to address various needs) a command issued may require several services on several systems to complete the work. These processes can be long lived elapsing minutes, hours or even days. Some domains may even include manual steps combined with the electronic processing.
For example a website with the ability to sell merchandise to customers may have orders submitted to an order service but the billing, shipping and emailing responsibilities are distributed.
The saga pattern was created as a way to coordinate these distributed long living asynchronous activities. A saga is a set of relatively independent transactions. Transactions can execute on multiple systems in sequential order, in parallel or combination of both. Sagas orchestrate the state changes as transactions return successful or faulted.
When one of the transactions returns successful in a saga with a series of sequential transactions the saga executes the next transaction. Each transaction could potentially execute on entirely different systems. When all transactions have returned successful the saga is considered completed.
Keep in mind the use of the word transaction in this post does not imply a database transaction. Transactions in a saga could represent any kind of processing the saga needs to coordinate. I flip flopped whether I should use the word transaction but since all the originating material around sagas uses this terminology I decided to stay consistent.
Sagas fit really well within systems designed with an event-driven architecture. The saga can subscribe to events and handle state changes when the events get published. The saga can issue commands as a result of the events (which in turn could generate more events).
Sagas use the correlation messaging pattern to match events related to the saga instance. The correlation pattern is one of the most fundamental messaging patterns. Correlation is used to track state since executing asynchronous processes we can’t rely on a call stack to have the state needed.
Faults
Sagas can handle both transaction failures and system faults. If a transaction returns failed the saga executes compensating actions to undo the changes. Distributed transactions seen in relational databases (SQL Server, Oracle, etc.) have similar abilities to roll back when faulted but if processing is long lived and spans several databases it’s not feasible to do a distributed transaction and lock data on multiple databases for minutes, hours or days. Another thing to consider is some processing may not occur on a technology that supports the concept of transactions which means compensating actions must occur within your application architecture.
Distributed transactions still have uses. If the saga or portions of the saga are short lived, a distributed transaction may make sense for rolling back some of the state. Sagas give you the ability to choose to issue compensating actions when distributed transactions don’t make sense.
If there is a need to retry when a saga has failed one approach to solve this is to nest sagas. The retrying workflow can be coordinated within the parent saga to handle when the nested saga fails. This allows the option to retry the transactions in the nested saga on another system than the one that originally faulted.
The example of the ProcessOrder saga shown below is a sequential saga. Compensating actions are created for each transaction in the workflow. If billing the customer fails both the BillOrder and SubmitOrder compensating actions execute to undo the changes.
Persisting saga state
Handling system faults is also important. Persisting saga state during any state change to some storage allows the system that has crashed to recover when it returns to availability. Reloading the state of the saga and begin processing un-processed messages (commands/events) allows the saga to resume where it left off. Both saga state persistence and reliable messaging is important for recovery. If the billing service publishes an OrderBilled success event while the order service is down it’s important the saga on the order service can resume when the service is available again to continue the workflow and issue the ShipOrder command to the shipping service. To do this when order service is back on line it should reload all sagas not completed from the persisted storage and since events will be waiting in the reliable message queue it can then process the unprocessed events.
Example
There are many different ways to implement a saga and for the purpose of this post I’ll be using a saga example that is inspired from NServiceBus for handling a sequential ordering process.
The saga needs a class to store data that will get persisted (document or relational database?) to allow the saga to resume even if the service is restarted or in the worst case crashes.
public class ProcessOrderSagaData : ISagaEntity
{
public virtual Guid Id { get; set; }
public virtual string Originator { get; set; }
public virtual Guid OrderId { get; set; }
public virtual Guid CustomerId { get; set; }
public virtual List<Guid> ProductIdsInOrder { get; set; }
public virtual bool CustomerHasBeenBilled { get; set; }
}
Now that we have a class to store the state for our saga we can create the saga itself.
public class ProcessOrderSaga :
Saga<ProcessOrderSagaData>,
ISagaStartedBy<OrderAccepted>,
ISagaStartedBy<AcceptOrderFailed>,
IHandleMessages<CustomerBilledForOrder>
{
public void Handle(OrderAccepted message)
{
this.Data.ProductIdsInOrder =
message.ProductIdsInOrder;
this.Data.CustomerId = message.CustomerId;
this.Data.OrderId = message.OrderId;
this.Bus.Send<BillCustomerForOrder>(
(m =>
{
m.CustomerId = this.Data.CustomerId;
m.OrderId = this.Data.OrderId;
}));
}
public void Handle(AcceptOrderFailed message)
{
this.Bus.Send<CancelCustomerOrder>(
(m =>
{
m.OrderId = this.Data.OrderId;
}));
}
public void Handle(CustomerBilledForOrder message)
{
this.Data.CustomerHasBeenBilled = true;
this.Data.CustomerId = message.CustomerId;
this.Data.OrderId = message.OrderId;
this.CompleteIfPossible();
}
private void CompleteIfPossible()
{
if (this.Data.ProductIdsInOrder != null &&
this.Data.CustomerHasBeenBilled)
{
this.Bus.Send<ShipOrderToCustomer>(
(m =>
{
m.CustomerId = this.Data.CustomerId;
m.OrderId = this.Data.OrderId;
m.ProductIdsInOrder =
this.Data.ProductIdsInOrder;
}
));
this.MarkAsComplete();
}
}
}
The ISagaStartedBy<T> interface provides a mechanism for NServiceBus to detect when this saga should be created and started. When an event of type OrderAccepted or AcceptOrderFailed is received the saga will begin. The saga would be subscribed to handle the events you see in the Handle() methods. If the AcceptOrderFailed event is published from the order service the saga will issue the CancelCustomerOrder command. I didn’t implement all the compensating actions in the example but similar compensating commands would exist for the other failure events.
Other uses
The saga pattern was designed for executing long living processes that execute transactions on distributed systems but the pattern itself is very helpful for more than just those situations. The saga pattern gives the developer ability to use an elegant API even when some of the needs from the original intent of the saga pattern aren’t desired (fault tolerance, distribution, long living). Composite applications that use an event aggregator for messaging within the same process between decoupled components have similar needs as distributed systems that communicate with messaging.
I could be convinced the name saga doesn’t make sense for some of these uses but the fact is the saga classes available in various libraries are still useful in situations beyond the original intent of long-living transactions.
Look out for more posts on sagas. I plan to show more examples and scenarios that I have found them useful.
Conclusion
Sagas allow us to coordinate complex asynchronous workflows that potentially process on distributed services. Event-driven architecture helps build a nice publish/subscribe communication for the saga. Compensating actions allow us to undo any changes we need to undo when any of the processing fails and persisting saga state allows us to be fault tolerant and support long living processes.



