Monthly Archives: January 2011

EventStore: Getting All Events Since…

One of the fundamental advantages of event sourcing coupled with CQRS is the ability to quickly and easily build your read models from your event stream.  This includes the ability to build alternate views or projections from your data (which is a *huge* advantage) as well as the ability to scale by creating duplicate, load-balanced read models from your event stream.

In order to scale your reads using something other than JSON files hosted on a CDN, such as a RavenDB, CouchDB, or even a traditional RDBMS, we need the ability to get all events from the event store.  In one of my recent commits to the EventStore project on GitHub, I added the ability to get all commits from storage from a particular point in time forward.

I have received a few questions regarding why the API for the EventStore queries for all events across all “streams” (meaning aggregates) from a certain time, rather than using some kind of index, such as an integer.  The argument on behalf of using integers is that time is a “fuzzy” and imprecise concept whereas a strict, atomically incrementing integer is a much more precise way to query for all events from a point in time.  Within a stream (or aggregate boundary) we can easily enforce atomically increasing indexes using an integer.  However, once we break outside of that consistency boundary things get more challenging.

The concept of an auto-incrementing integer or “identity” is usually a concept that is provided by a relational database or other fully consistent storage engines.  The abstraction exposed by the EventStore allows it to function on top of dozens of different kinds of storage engines—perhaps more than any other kind of abstraction.  As a result, we can’t rely upon the persistence mechanism to expose a unique, ever-increasing integer value.  Many NoSQL implementations don’t support it.

For most, the purpose of querying the event store directly for all events is related to the concept of building a read model from scratch.  In this situation there really isn’t any issue with the fuzziness of time.  But suppose, for whatever reason, you needed to query the event store multiple times to get all committed events since your last query to the event store.  Because clocks may not be 100% synchronized across hardware, can we really be sure we’re not missing events if we ask the event store for all events since our last query?

To handle this situation, the client merely needs to query for all events since a few minutes before the last query.  As long as your server clocks are relatively synchronized using NTP, a few minutes is more than enough to cover variance in clocks.  At the same time, now you’re also assured to receive events that you’ve already handled previously, which means you must de-duplicate and drop ones that have already been handled.  There are a few possible solutions here.  The first is to track the commit identifiers.  If you’ve seen it previously, it’s a duplicate and can be discarded.  Another is to track the most recent revision that you’ve seen for a particular stream/aggregate and discard those on/below a certain revision.

While it may seem like some extra work for the client, and indeed it is, the benefits are massive because we have a guarantee that we can easily swap our storage infrastructure with only token effort.  A few technologies, such as NHibernate, try to deliver on this same promise, but ultimately come up short.

The bottom line is that our consistency boundaries should always be that of a stream.  Anything inside of that stream is consistent, while anything outside of that stream is not guaranteed to be fully consistent.  By doing this we are more fully able to distribute our system and we can can leverage any storage technology we choose.

Data Precision in NoSQL

I’ve got to hand it to the NoSQL teams. In some recent work on my EventStore project, I have seen how, when you give them a value to story and then query the associated document/row/whatever, the NoSQL solutions hand back the *exact same value* every single time.  The RDBMS crowd is another story.

MySQL has an outstanding bug whereby it truncates any DateTime values and forgets the milliseconds on the value.  Great.  Oh, and the bug has been outstanding for *five years*.  Ouch.

Access has some other, similar quirks. I’m not really sure if Access really deserves being mentioned with any other kind of persistence engine.

In contrast, when I give something to a NoSQL store, I can pretty much expect to get the same value back. At the same time, I should point out that in some cases, the default, client-side serialization for various storage engines may affect precision of some values, like dates and times.  Even so, this is a limitation of the client-side serialization rather than the storage engine itself.