ILMerge Gotcha

Just remember that when two assemblies have internalized a reference to another assembly using ILMerge, each gets their own “copy” of static variables.  That is to say, that the class MyClass1.SomeStaticInstance is no longer the same between the two assemblies.  I was kicking against this one for the better part of an hour.  At first I thought it was some quirk with the [ThreadStatic] attribute I was using.  It wasn’t.

Removing 2PC (Two Phase Commit)

I received the following email today and I thought I’d answer it as a blog post so that all can benefit:

If you remove the 2PC from the system, how do you deal with ensuring that published events are:

* truly published,

* not lost,

* that there is confidence that the interested parties (subscribers) are truly receiving and processing the events in a ‘timely’ manner

* and that there is confidence and a method that the system can gracefully recover from unexpected situations?

This is an area that seems to be glossed over quite a bit in the talks and sample code.

Most of the time, the ‘talks’ say, "Throw it in a durable queue, and then it’s there (easy-peasy)" and sample code uses an in-memory synchronous stream of registered methods calls to have a lightweight bus.

How do dev’s handle this in the ‘real-world’?

Perhaps I can ask you this question – how do you normally implement your event processing logic and ensuring that, for example, read models are updated properly even when, for example, db connections go down without using 2PC?

Here is my answer:

In the EventStore, here’s how I handle publishing without 2PC:

  1. When a batch of events are received, they are durably stored to disk with a “Dispatched” flag of false—meaning they haven’t been published.
  2. The batch of events which I call a “commit” is then pushed to a “dispatcher” (like NServiceBus) on a different thread.
  3. The dispatcher publishes all of the events in the commit and commits its own transaction against the queue.
  4. The dispatcher marks the batch of events/commit as dispatched.

If at any point the dispatch fails, the commit is still marked as undispatched and when the system comes back online it will immediately publish those events.  Yes, the introduces a slightly possibility that the message might be published more than once, which is why we need to de-duplicate when handling messages in the read models.

As far as updating a read model without 2PC, here are the general steps:

  1. Receive the event and update your view models accordingly.
  2. As part of the same database transaction, record the unique identifier for the message into some kind of “I’ve handled this message already” table.
  3. Commit the database transaction.

If that message is ever received again, your system will try to insert into the table with all of the message ids, but the transaction will fail because you’ve already handled the message previously. Simply catch this duplicate insert exception and then drop the message.  Or you could be more proactive and find out if the message id is already in the table before you update anything.  But this situation should be very seldom such that I would recommend the more reactive approach.

One question that is too often overlooked is, how long do you keep those identifiers around before remove them.  That question will depend upon the nature of your system and how long the possibility of duplicate messages exist.  You could easily setup an automated/scheduled task to clean out old identifiers that have been in the table for perhaps a few days or even a week or more.

If you’re using a storage engine for your view models that doesn’t support traditional transactions, e.g. a NoSQL store, then you have to be a bit more creative about how you de-duplicate messages, but that’s another topic for another day.

CQRS: Event Sourcing and Immutable Data

There are a number of interesting and unique advantages offered by event sourcing as well as messaging in general.  Some of these advantages include the ability to perform merge-based, business-level concurrency—as compared to simple optimistic or pessimistic concurrency.  Further, the ability to replay all stored messages into new or alternate models because business context and intent has been captured, is invaluable.  One interesting advantage that may not seem like much at first is the immutability of the events once they are committed to a persistent medium.

This idea is incredibly powerful because it completely solves with of the primary challenges in computing—concurrency and the need for a single, authoritative source of truth.  Once an event has been accepted and committed, it becomes an established fact—as unalterable as a decree from Pharaoh—and it can be copied everywhere.  The only way to “undo” an event is to add a compensating event on top—like a negative transaction in accounting.

So what does that mean for my application?  How can I take advantage of the immutability of events?  Well, for starters, by following the ideas found below you can completely eliminate almost any chance of the high-profile data loss issues that companies face because you can rebuild your views and reports by replaying all recorded events.

One of the biggest and most obvious ways to take advantage of immutable events is to increase your application’s performance by caching like crazy.  Why not replicate the events into some kind of disk-based in-memory cache across a bunch of nodes?  Because the events never change we can always be assured we’re getting the latest version of a particular event regardless of which cache node we talk to.  In this way, we can read from almost anywhere when we populate our read models or create other reports from our events.

Furthermore, backups become ultra simple—just get all events that have been added since the last backup.  Disaster recovery involves reading data from somewhere else.

From time to time the question comes up, won’t this take up a lot of space?  Yes, it might.  What if I want to clear our some events to make room for new ones?  Can we snapshot and delete/archive events from before the snapshot?  The answer is…why?  Disk space is cheap and data is valuable.  Why not replicate those events to another kind of ultra-cheap, yet highly available storage such as S3 or Azure Blobs?  In this way we have the most immediate and recent events available locally, but all of older events available to us with a few simple queries—but at a slightly higher cost in terms of latency.  How much does 1GB of storage cost on S3?  How many business events can you store in 1 GB of space—especially  when compressed?

If somehow events could change, all of the above advantages would disappear and we would have to keep going back to a single source of truth—a single point of failure—to be sure we had the latest version.  How much of your data is held hostage inside of a legacy database?

If you have a lot of systems that listen to those events, each system could maintain its own copy, thus decreasing the load on your primary or mission-critical systems.

The last advantage is almost unnoticed in the way it compensates for a nefarious and silent killer—media decay or “bit rot”.  Ever had a hard drive slowly and silently corrupt your data?  When media goes bad, it takes your data with it.  But because our data is immutable, it becomes very easy to detect via checksums that the data has been altered by tampering or by media decay.  In our world, this isn’t a problem because we don’t need a single source of truth.  Much like distributed source control, e.g. git, mercurial, etc., any repository can be the authoritative source if we detect problems in our most readily available copy.  Bye bye silent media corruption.

If we wanted we could even write our events to a solid state drive (SSD) so that we can accept writes more quickly.  SSDs generally don’t like lots of writes to the same physical sectors and have tendency to wear out over time as more and more writes occur to the same areas, but our events have a “solid” state, which means that we write them once and then read forevermore.  Thus, the wearing out of an SSD is much less of a problem.

Immutability is a very simple property, but it has profound implications and we can more easily build a truly distributed system.

IIS 7 “500″ Errors

I paid my “Windows tithing” recently and did a complete reinstall. Fortunately Windows is now a guest VM inside of a Linux host. A settings change I had made a long time ago but forgot to reapply during my reinstall was for IIS. Whenever I was developing–even locally–I would get “500″ errors from IIS which would then display a generic and very unhelpful error page.

The solution is to go into IIS and disable generic error messages:

http://mvolo.com/blogs/serverside/archive/2007/07/26/Troubleshoot-IIS7-errors-like-a-pro.aspx

New NServiceBus Feature: 32-bit (x86) Host Process

NServiceBus is an “Any CPU” framework.  It doesn’t have an 32-bit or 64-bit specific code.  This makes it very easy to transition between 32-bit and 64-bit operating systems.  Unfortunately, not all assemblies are or even can be compiled using the default “Any CPU” architecture. In many, if not most cases, this is related to legacy systems that have 32-bit specific code for platform interop with native C libraries, etc.

If you use the default host–NServiceBus.Host—your application will always load in 64-bit (x64) mode if you’re on a 64-bit OS or in 32-bit (x86) mode for a 32-bit OS.  Again, this is typically not a problem.  But if there are assemblies or other libraries containing 32-bit code that must be invoked and loaded into the process, we’ve got a problem—a BadImageFormatException problem.

I recently pushed a commit to the master branch of NServiceBus on GitHub that compiles two specific versions of the NServiceBus Host.  It compiles the default “Any CPU” version as usual.  But now it also compiles one called NServiceBus.Host32.exe.  This will allow users running a 64-bit OS to run a 32-bit NServiceBus process thus allowing the execution of 32-bit binaries/code without having to resort to workarounds such as corflags.exe which instruct the .NET Framework to run in 32-bit mode.

Installing the VirtualBox Extension Pack on Ubuntu 10.10 x64

There have been quite a few posts related to issues installing the VirtualBox Extension Pack for both Windows and Linux hosts.

  • http://forums.virtualbox.org/viewtopic.php?p=11262&sid;=334fb962995ae00d32bb8988192f701c
  • http://www.virtualbox.org/ticket/7899
  • http://www.virtualbox.org/ticket/7972
  • http://blogs.oracle.com/wim/2010/12/oracle_vm_virtualbox_40_extens.html

The error message given is very cryptic:

“Failed to install the Extension Pack” NS_ERROR_FAILURE (0×80004005)

Weird.

In digging through the above posts I found tidbits of the solution that I was able to put together. I’m currently running a Ubuntu 10.10 x64, so here’s how I solved the problem and installed the extension pack. Give the VBoxExtPackHelperApp execute permissions and then run the install from the command line.

  1. sudo chmod 744 /usr/lib/virtualbox/VBoxExtPackHelperApp
  2. sudo /usr/lib/virtualbox/VBoxManage extpack install Oracle_VM_VirtualBox_Extension_Pack-4.0.4-70112.vbox-extpack

Conference Sessions – Distributed Systems

I will be speaking at two upcoming conferences.  The Rocky Mountain Tech Trifecta in Denver, Colorado on March 5th and the Utah Code Camp in Salt Lake City, Utah on March 19th.  I will be speaking on distributed systems and messaging.

I’ll be touching on the following topics:

  1. The Fallacies of Distributed Computing
  2. Messaging
  3. Publish/subscribe
  4. NoSQL
  5. CAP
  6. CQRS
  7. Event sourcing (depending upon time constraints)

If you’re attending the Utah Code Camp, go vote for my session.

EventStore: Getting All Events Since…

One of the fundamental advantages of event sourcing coupled with CQRS is the ability to quickly and easily build your read models from your event stream.  This includes the ability to build alternate views or projections from your data (which is a *huge* advantage) as well as the ability to scale by creating duplicate, load-balanced read models from your event stream.

In order to scale your reads using something other than JSON files hosted on a CDN, such as a RavenDB, CouchDB, or even a traditional RDBMS, we need the ability to get all events from the event store.  In one of my recent commits to the EventStore project on GitHub, I added the ability to get all commits from storage from a particular point in time forward.

I have received a few questions regarding why the API for the EventStore queries for all events across all “streams” (meaning aggregates) from a certain time, rather than using some kind of index, such as an integer.  The argument on behalf of using integers is that time is a “fuzzy” and imprecise concept whereas a strict, atomically incrementing integer is a much more precise way to query for all events from a point in time.  Within a stream (or aggregate boundary) we can easily enforce atomically increasing indexes using an integer.  However, once we break outside of that consistency boundary things get more challenging.

The concept of an auto-incrementing integer or “identity” is usually a concept that is provided by a relational database or other fully consistent storage engines.  The abstraction exposed by the EventStore allows it to function on top of dozens of different kinds of storage engines—perhaps more than any other kind of abstraction.  As a result, we can’t rely upon the persistence mechanism to expose a unique, ever-increasing integer value.  Many NoSQL implementations don’t support it.

For most, the purpose of querying the event store directly for all events is related to the concept of building a read model from scratch.  In this situation there really isn’t any issue with the fuzziness of time.  But suppose, for whatever reason, you needed to query the event store multiple times to get all committed events since your last query to the event store.  Because clocks may not be 100% synchronized across hardware, can we really be sure we’re not missing events if we ask the event store for all events since our last query?

To handle this situation, the client merely needs to query for all events since a few minutes before the last query.  As long as your server clocks are relatively synchronized using NTP, a few minutes is more than enough to cover variance in clocks.  At the same time, now you’re also assured to receive events that you’ve already handled previously, which means you must de-duplicate and drop ones that have already been handled.  There are a few possible solutions here.  The first is to track the commit identifiers.  If you’ve seen it previously, it’s a duplicate and can be discarded.  Another is to track the most recent revision that you’ve seen for a particular stream/aggregate and discard those on/below a certain revision.

While it may seem like some extra work for the client, and indeed it is, the benefits are massive because we have a guarantee that we can easily swap our storage infrastructure with only token effort.  A few technologies, such as NHibernate, try to deliver on this same promise, but ultimately come up short.

The bottom line is that our consistency boundaries should always be that of a stream.  Anything inside of that stream is consistent, while anything outside of that stream is not guaranteed to be fully consistent.  By doing this we are more fully able to distribute our system and we can can leverage any storage technology we choose.