Event Sourcing and Snapshots

One part of Event Sourcing that could become problematic is those objects with long, complex lifetimes. In virtually all cases, an object's lifetime is relatively short--perhaps a dozen events or so. But there are cases in which an object may live for a very, very long time and be used frequently. Greg gives an example in one if his talks of an object that gets thousands of new events per day. Loading up this object can be expensive because you have to load up all of the state transitions since the object's inception.

One shortcut around this is the concept of a snapshot. You send the aggregate a snapshot command message of some kind and it produces a snapshot message which contains all of its state--along the lines of converting a domain object to a DTO, except this comes from inside of the domain object rather than from the outside.

Once we have this snapshot message, we persist it. Then, when loading up the object from storage, we load all of the events up to and including the last snapshot. This allows us to restore the object to a certain state and then "replay" all events since the last snapshot.

Here's a critical piece that is often overlooked and not really explained in Greg's devTeach presentation--the aggregate version. The aggregate has some kind of version number indicating how many events have been run against it since it was first created--including the aggregate creation event. At 39:45 in the video he starts explaining the DB schema and briefly mentions the version number, but doesn't go into any detail.

In my proposed event storage schema I have a column called "snapshot_event_seq" which is the number of the event at which the last snapshot occurred. When an aggregate is initially created it will have a a snapshot_event_seq or version of zero. In memory, as messages are produced by the aggregate root, the sequence counter or version will increment. This will then indicate the version of the aggregate as a whole.

The trick is to *not* increment the snapshot_event_seq in your persistence mechanism when events are saved. Only when a snapshot is created is that value updated. This reason is that, when we load up an aggregate from storage, we set the "version" of the aggregate equal to the last snapshot_event_seq - 1. Then, we replay the event messages that have occurred since (and including) the last snapshot against the aggregate, thus incrementing the version in memory. For example:

In-memory aggregate version: 12

Persisted aggregate version (Last snapshot event sequence): 10

Events since (and including) last snapshot: 3

When loading, we set the version to 10 and then add 1 for each event (10 - 1 + 1 + 1 + 1) = 12, which is the same as the in-memory representation.

This solves a very important issue. If we simply incremented the value in storage each time an event was saved, our aggregate version would get out of sync. Specifically, we'd load up the aggregate and set the version to the snapshot_event_seq - 1 and then replay all events since (and including) the last snapshot. For example:

In-memory aggregate version: 12

Persisted aggregate version: 12 (because we're incorrectly incrementing it with each event saved).

Events since (and including) last snapshot: 2

When loading, we set the version to the incorrect value of 12 - 1 and add one for each event: (12 - 1 + 1 + 1) = 13, which is out of sync with the in-memory representation.

See the problem? To solve it we do the following: In memory, we set the aggregate version to the number of messages processed, but when persisting, we only update the version when a snapshot is taken.

One of the reasons I call the aggregate version "snapshot_event_seq" is to ensure that we don't accidentally confuse the difference between the in-memory and persisted representation of the two values.