Sagas, Event Sourcing, and Failed Commands

There is an interesting thread on the DDD/CQRS group involving sagas. I'm posting my reply below because I want to make it available as a solution to those that encounter this same problem. All replies should be directed on the CQRS thread instead of replying here below.

To give a small amount of background, the question is related to how a failure or refusal on the part of the domain to perform some action is communicated back to the saga such that it can take the appropriate, compensating action.

Here's the post:

We've run into almost this exact situation and considered a number of ways to solve the problem.

To sum up the problem, it's that we need to communicate a failure or rejection of a command/instruction by the domain back to a saga such that the saga can take the appropriate, compensating action.

As previously mentioned in this thread, an aggregate must enforce its invariants and is not allowed to enter an invalid state. This presents somewhat of an issue because we want to communicate the failure back to the saga but this failure cannot be communicated *through* the domain.

One potential solution (although not the one that we use) was raised during Greg's course when he was discussing sagas. He hinted at the idea that a call from the saga to the domain could be made via RPC. In essence the command message is sent synchronously via some kind of RPC-like call or web service instead of asynchronously using a message bus or message queue. In this way the failure can easily be communicated back in the form of a fault that is understood by the saga. Ultimately we went with something a little bit different because we were exposing our domain behavior through message handlers that listened to a message bus and we wanted to maintain asynchronous messaging throughout or system.

Our solution was the following (and it's worked quite well):

1. The saga asynchronously dispatches a command message to the domain.

2. The message handler receives the command message, loads the aggregate from the repository, and calls the appropriate method on the aggregate.

3. The domain object checks its invariants and ultimately decides to throws an intention-revealing, well-named, domain-specific exception.

4. The command handler catches the well-named exception. Because the command handler knows the type of message received along with its intention and the type of exception that was thrown and caught, it's in a position to relay that exception in the form of a message back to the saga via a bus.Reply(). But this time, it's not an event message, which represents something that something happened. Instead it's a message that describes something that didn't happen. We didn't have a name for this kind of message but the terms notification and alert kept coming back. Ultimately we decided to call these kinds of messages alert. These messages could also potentially be called faults or something else, but they should be considered separate from events.

5. The saga receives the alert/fault message and applies the appropriate action.

The one question that still remains is that we lose the alert message once it's consumed by the saga and the concern is that we want to keep track of everything that has happened. I couldn't agree more. Keeping track of what's happened and it's extremely important. But a better question is who's responsible for keeping track of this fault/alert? Should the domain keep track of something that didn't happen or something that it refused to do? Isn't it the saga that cares about the command and associated failure? Shouldn't the saga (or its infrastructure) be responsible for tracking all messages relative to the saga?

In our implementation, we actually implement our sagas using event sourcing. What this means is that all messages addressed to the saga are replayed in order to re-build the state of the saga. In addition (and as an auditing benefit), we use an event store to dispatch outbound message from the saga. This means that a saga is completely autonomous and separate from the domain and is only coupled by the message contracts. It also means that we can replay incoming messages against a new implementation of a saga if necessary to come up with an alternate model as our sagas and domain evolve to changing business requirements.

To sum things up:

1. When the domain refuses to do something, it doesn't generate an event. Instead, it throws an exception.

2. The command handler handles the exception and does a bus.Reply() with an "alert" or "fault" message.

3. Because the message is not generated by the domain, but by the layer just on top of the domain, we don't track this alert/fault message inside of the domain.

4. The saga receives the message and takes appropriate, compensating action.

5. Because the saga is implemented using event sourcing, nothing is ever lost. We have a complete business/audit history of what happened and we can evolve our saga model and rebuild it with full confidence in our message history.