DDDD: Distributing A Single Bounded Context

In a previous post, I detailed all of the reasons why processing messages one by one with an aggregate root was a good thing. I am now going to contradict some of the main points in that post. While certain aspects still hold true, there is a much larger matter involved.

First, the points that I still consider to be valid:

  1. Don't have threading logic in your aggregate roots. Don't do things with locks, monitors, mutexes, etc. Your domain objects should run on a single thread at a time. They're going to be complex enough as is with business concerns, why add so much incidental complexity related to threading and infrastructure to cloud their intent?
  2. Don't use synchronous calls to services within your aggregates. Send a message to the service. The service will, at some point in the future, publish a message asynchronously back to the aggregate and the aggregate will continue processing.

While messaging patterns easily allow us to distribute the various/different bounded contexts (subsystems) across multiple machines, how do we handle splitting a single bounded context? My initial understanding was to us some kind of partitioning strategy with advanced routing mechanisms. This type of strategy may be overkill for most scenarios.

The simple answer is optimistic concurrency—sort of. Optimistic concurrency tells us that, while we were working with a copy of an object, the original object has changed underneath us. Greg has mentioned IConflictWith as a way to solve this issue. But there's still something missing. IConflictWith is only designed to deal with *commands* that you send to your domain. When a domain object receives two commands that conflict, it may ultimately reject one or both. Events are another matter. Events happened and must be dealt with—they cannot be rejected.

In a single machine/context environment, things are easy. In fact, it's the simplest thing that could possibly work. The problems that you may ultimately encounter in this scenario relate to fault tolerance/single point of failure and vertical scaling to name a few. For bounded contexts in your domain that require a timely response due to strict SLAs, you absolutely must have some kind of fault tolerance. This requires distributing a single bounded context such that it runs on multiple machines simultaneously.

The big problem running a single bounded context on multiple machines relates to consistency within aggregates. If an aggregate can run on machine A and machine B simultaneously, who's right? Who is "consistent"? They both are. By using messages, we send all appropriate events to the other instance of the bounded context so that it can be made aware of what happened in our instance. But that introduces yet another problem still: what happens if they can't communicate?

If you have two machines, you can virtually guarantee that the network between them will be unavailable at some point in time. But if the network is unavailable, how do we ensure that the events between copies of the same aggregate are handled by the other copies quickly? How can each copy be made aware of the events from the other copies? They can't. At least, not until network connectivity is restored.

Suppose you have geo-distributed your application. One critical bounded context of the application is running on a server in Los Angeles and another copy of that same bounded context in New York. If for some reason that network between them goes down, each piece will continue to operate and service requests. If the network is down for a few minutes, hours, possibly even days, each piece can continue to process and be available.

In the above scenario, what are the chances that each piece may have performed some work that was also performed by the other? What are the chances that each copy has started to diverge from the other? The chances are pretty good—especially as more time passes. How do we handle this? How do we reconcile conflicting differences between two copies of the same aggregate object? Compensating actions.

At some point in the messaging pipeline—perhaps as late as the reporting bounded context—something will detect the anomaly showing that these two aggregate instances have produced conflicting results. Whoever detects the anomaly sends a message back into the aggregate letting it know of the conflicted state. The aggregate would then take the appropriate compensating action in order to reconcile the differences.