Removing 2PC (Two Phase Commit)

I received the following email today and I thought I'd answer it as a blog post so that all can benefit:

If you remove the 2PC from the system, how do you deal with ensuring that published events are:

* truly published,

* not lost,

* that there is confidence that the interested parties (subscribers) are truly receiving and processing the events in a 'timely' manner

* and that there is confidence and a method that the system can gracefully recover from unexpected situations?

This is an area that seems to be glossed over quite a bit in the talks and sample code.

Most of the time, the 'talks' say, "Throw it in a durable queue, and then it's there (easy-peasy)" and sample code uses an in-memory synchronous stream of registered methods calls to have a lightweight bus.

How do dev's handle this in the 'real-world'?

Perhaps I can ask you this question - how do you normally implement your event processing logic and ensuring that, for example, read models are updated properly even when, for example, db connections go down without using 2PC?

Here is my answer:

In the EventStore, here's how I handle publishing without 2PC:

When a batch of events are received, they are durably stored to disk with a "Dispatched" flag of false—meaning they haven't been published.
The batch of events which I call a "commit" is then pushed to a "dispatcher" (like NServiceBus) on a different thread.
The dispatcher publishes all of the events in the commit and commits its own transaction against the queue.
The dispatcher marks the batch of events/commit as dispatched.

If at any point the dispatch fails, the commit is still marked as undispatched and when the system comes back online it will immediately publish those events. Yes, the introduces a slightly possibility that the message might be published more than once, which is why we need to de-duplicate when handling messages in the read models.

As far as updating a read model without 2PC, here are the general steps:

Receive the event and update your view models accordingly.
As part of the same database transaction, record the unique identifier for the message into some kind of "I've handled this message already" table.
Commit the database transaction.

If that message is ever received again, your system will try to insert into the table with all of the message ids, but the transaction will fail because you've already handled the message previously. Simply catch this duplicate insert exception and then drop the message. Or you could be more proactive and find out if the message id is already in the table before you update anything. But this situation should be very seldom such that I would recommend the more reactive approach.

One question that is too often overlooked is, how long do you keep those identifiers around before remove them. That question will depend upon the nature of your system and how long the possibility of duplicate messages exist. You could easily setup an automated/scheduled task to clean out old identifiers that have been in the table for perhaps a few days or even a week or more.

If you're using a storage engine for your view models that doesn't support traditional transactions, e.g. a NoSQL store, then you have to be a bit more creative about how you de-duplicate messages, but that's another topic for another day.

Menu