EventStore: Getting All Events Since…

One of the fundamental advantages of event sourcing coupled with CQRS is the ability to quickly and easily build your read models from your event stream. This includes the ability to build alternate views or projections from your data (which is a *huge* advantage) as well as the ability to scale by creating duplicate, load-balanced read models from your event stream.

In order to scale your reads using something other than JSON files hosted on a CDN, such as a RavenDB, CouchDB, or even a traditional RDBMS, we need the ability to get all events from the event store. In one of my recent commits to the EventStore project on GitHub, I added the ability to get all commits from storage from a particular point in time forward.

I have received a few questions regarding why the API for the EventStore queries for all events across all "streams" (meaning aggregates) from a certain time, rather than using some kind of index, such as an integer. The argument on behalf of using integers is that time is a "fuzzy" and imprecise concept whereas a strict, atomically incrementing integer is a much more precise way to query for all events from a point in time. Within a stream (or aggregate boundary) we can easily enforce atomically increasing indexes using an integer. However, once we break outside of that consistency boundary things get more challenging.

The concept of an auto-incrementing integer or "identity" is usually a concept that is provided by a relational database or other fully consistent storage engines. The abstraction exposed by the EventStore allows it to function on top of dozens of different kinds of storage engines—perhaps more than any other kind of abstraction. As a result, we can't rely upon the persistence mechanism to expose a unique, ever-increasing integer value. Many NoSQL implementations don't support it.

For most, the purpose of querying the event store directly for all events is related to the concept of building a read model from scratch. In this situation there really isn't any issue with the fuzziness of time. But suppose, for whatever reason, you needed to query the event store multiple times to get all committed events since your last query to the event store. Because clocks may not be 100% synchronized across hardware, can we really be sure we're not missing events if we ask the event store for all events since our last query?

To handle this situation, the client merely needs to query for all events since a few minutes before the last query. As long as your server clocks are relatively synchronized using NTP, a few minutes is more than enough to cover variance in clocks. At the same time, now you're also assured to receive events that you've already handled previously, which means you must de-duplicate and drop ones that have already been handled. There are a few possible solutions here. The first is to track the commit identifiers. If you've seen it previously, it's a duplicate and can be discarded. Another is to track the most recent revision that you've seen for a particular stream/aggregate and discard those on/below a certain revision.

While it may seem like some extra work for the client, and indeed it is, the benefits are massive because we have a guarantee that we can easily swap our storage infrastructure with only token effort. A few technologies, such as NHibernate, try to deliver on this same promise, but ultimately come up short.

The bottom line is that our consistency boundaries should always be that of a stream. Anything inside of that stream is consistent, while anything outside of that stream is not guaranteed to be fully consistent. By doing this we are more fully able to distribute our system and we can can leverage any storage technology we choose.