NServiceBus Distributed Transaction Woes

In the industry there is a general trend away from locking, two-phase commits. In theory, quorum and Paxos-style commits ensure consistency across disparate resources, but why not avoid the issue altogether? The fundamental idea behind each of these commit protocols is to ensure application consistency at an infrastructure level. Rather than relying on low-level protocols to ensure consistency, why not embrace inconsistency and handle it explicitly within our application? In this way we can avoid anything but simple, "single phase" commits against a single transactional resource such as a database or message queue.

Okay, hold on, stop the press. Don't we have to wrap our transaction around both the message queue and database? The general rule for NServiceBus is…yes. But as we can see, there are problems and difficulties getting NServiceBus to place nicely with two-phase, distributed transactions. To be fair, it's not actually NServiceBus, it's subpar database drivers and configuration overhead related to MSDTC.

A while back we spent some time trying to get NServiceBus to play nicely in a two-phase distributed transaction against various databases. We wanted to try out databases other than SQL Server and Oracle both of which have excellent drivers, tools, paid and community support, but would require us to mortgage our children and grandchildren. With that in mind we tried out MySQL and PostgreSQL. We ran each database server in both Windows and Linux both locally and remotely and we tested various drivers—both the "official" drivers as well as available third-party free and paid drivers related to each database. We created a test harness to exercise each driver to flush out specific failure conditions we might find in production.

Our test harness worked with each driver as a simple IDbConnection. We had the driver connect to the database locally (on Windows), perform some work and commit. We also had it do the same against a remote database server (both on Windows and Linux). We had the test harness perform simulated failures like disabling the network in the middle of a transaction and watching the result. We also had it send shutdown and kill commands to each DB instance at various points in the commit process to verify. We shut down and killed MSDTC at various points and watched the result. We performed distributed transactions against variations of the same database, the same vendor DB on a different machine, different drivers against the same database, or finally a different DB driver and DB vendor altogether. At each stage we created successful scenarios and then variations on failure scenarios from the process dying during the each stage of the transaction, e.g. pre/post "PREPARE" and pre/post "COMMIT".

If it sounds like a lot of tests, it is. We ran each permutation of the test with each driver against each local database and each remote database (Windows and Linux). We ran each outside of a transaction, inside a transaction, and inside a distributed transaction. Then we looked at the results.

The only drivers that consistently performed in a distributed transaction were the official Microsoft drivers running against SQL Server. Big surprise. [Note: We didn't run these tests against Oracle.] While each driver held its own and performed consistently in a single transaction (using both TransactionScope and connection.BeginTransaction()), it was the distributed transactions that gave us trouble.

The official MySQL .NET driver doesn't support distributed transactions. A 3rd party MySQL driver did support distributed transactions, but enlisted itself as a "volatile" resource in the distributed transaction. Further MySQL actually rolls back *after* a PREPARE if the connection is lost. Bad. Bad. Bad.

The various PostgreSQL drivers had other issues. One driver properly enlisted in the DTC but couldn't handle rollbacks of a failed distributed transaction. One cohort would vote "no" and the PostgreSQL driver wouldn't rollback the PREPARED transaction which would create an exclusive lock on the associated row and block SELECTs against the table. All in all, it was a little troubling.

We didn't inspect Firebird which after SQLite is probably the only other open source DB we'd consider. There are several considerations in our choice for a relational storage engine. The first is NHibernate support, the second is community support/size/activity. If we don't have a degree of traction from the thriving community, we won't touch it.

But there is hope. First and foremost, the above issues are only manifest when working against a transactional queue. If the queue isn't transactional, then we don't have this problem. Second, if we absolutely positively required a distributed, two-phase commit, we can use SQL Server.

If we design our services along business boundaries according to Udi's SOA principles, we can have multiple instances of SQL Express running on multiple boxes. Each DB within each Express instance is limited to 4GB of storage. That's actually quite a bit of storage. If we had a single autonomous component exceed 4GB, we could easily fork out the cash for a proper DB license because the AC would have justified its need.

At the same time, what if we were able to avoid 2PC commits or any style of distributed commit altogether? What if we wanted to use an alternative storage engine, like CouchDB, Redis, Cassandra or Riak? All of these troubles lead us down a different path, which is the subject of my next post: "Extending NServiceBus: Avoiding Two-Phase Commits".