In my previous blog post, I elaborated upon Greg's devTeach talk in which he hints at distributing a bounded context. I talked about how using what Greg calls a "locator" can facilitate distribution without introducing concurrency problems.
The question was asked in his talk at 46:40, "Is the locator a single point of failure?" Because his talk was not actually about distributed systems, but rather about CQS and some foundational aspects of DDDD, Greg pulled back from that question and moved on.
Here's the answer: No, it's not a single point of failure.
If you had a single instance of a "locator", yes, it would be. But if you had multiple instances, then you'd be okay. Great, now we reintroduced a whole mess of concurrency issues by having multiple locators.
Enterprise Integration Patterns
Before we dig into solving concurrency between each instance of the locator, we should probably use the appropriate name for the solution. In Gregor Hohpe's book, Enterprise Integration Patterns, he talks about a pattern he calls a Content-based Router. The "locator", per Greg's talk, seems to be content-based router. The word "locator" to me sounds much more request/response-oriented, while a content-based router sounds much more like its Ethernet counterpart, which forwards messages (or packets) to an endpoint. Therefore, rather than calling it a "locator", I will simply refer to it as a content-based router or "router" for short.
Semi-stateful Routing
The content-based router could either be implemented with hard-coded rules (stateless), or it could function in a much more dynamic fashion (stateful).
In Greg's devTeach talk, he talked about how the router would be aware of CPU time and available memory on each node to which it could forward a command. In his interview with Eric Evans, he talked about ensuring that no two ticker symbols in his stock trading system would live on the exact same node—thus drastically reducing the probability that multiple node failures would bring down any ticker symbol.
From a performance standpoint, the router makes all of its decisions based upon information available in memory, rather than using any kind of external service call or database. In this manner, the router would perform as if it was completely stateless, while in fact, it remembers on which node a particular aggregate root lives.
Single Point of Failure
The router has the responsibility of determining the node to which commands will be sent and remembering that information. By itself, it's a single point of failure. To avoid this, we simply apply the same pub/sub model to the router. Whenever a decision is made regarding routing tables, we publish that information to other instances of the router. In this manner, we avoid the router being a single point of failure.
Concurrency Revisited
If we have multiple routers, don't we have to manage concurrency between them? Yes, if they process messages simultaneously. But if we only have one instance of the router active at a time and listening for messages, then we don't have a problem.
How do we ensure that one and only one router is listening for messages to forward?
Stayin' Alive / Another One Bites The Dust
At this point, we should introduce an infrastructure message, the heartbeat message. At a specified interval each instance of a bounded context AND each router would publish a heartbeat message and publish it to each router saying "I'm alive." We can use this message and embed a few other tidbits of information—including available memory or CPU time, etc.
This message serves two purposes. First, if a particular node misses X hearbeats, we know that the node has failed and we can re-direct the messages to another node. We then notify all other routers of the same so that they know of the change. [Aside: We also need to figure out how to deal with messages that have already been sent to the failed node.]
The same goes for the routers themselves. Only one router is actively forwarding messages while the others are idling, listening to state changes from the active router and heartbeats from the nodes. When the active router misses X heartbeats, the remaining routers elect a new leader. (A super-easy formula for leader election is to use the one with highest fixed value, e.g. IP address, or # of CPUs, or # of GB of RAM, etc.)
Once the new leader is elected, it takes over and starts processing messages and forwards them to the appropriate node. When the failed router comes back online, it sees that another router is actively processing (probably through the heartbeat message from the newly elected primary router) and assumes its place in idle mode. [Aside: How would it get the list of routes from the primary router? Or is that information truly necessary?]
When either a new or failed node comes online, it should publish a message to all routers as if it was a brand-new node because it probably doesn't have any in-memory state any more. Or even if it did, the routers may have redirected messages elsewhere.
Conclusion
Message-based systems can be exponentially more complex to orchestrate than simple request/reply models, but it's because they're exponentially more scalable. Fortunately this complexity resides outside the domain and is in separate, reusable, infrastructure components so that you can grow into this solution as necessary.
As a final thought, I would recommend that you have a set of routers for each bounded context that needs concurrency.