Message Idempotency Patterns and Disk Usage

One of the main concerns that may be expressed in reviewing my previous blog post about idempotent messaging patterns is that of disk usage. The concern is about retaining every. single. message we ever publish. That's a lot of messages. That's a lot disk space.

Before we dig into one plausible solution, let's first look at a few mitigating factors. First, disk space is cheap. Really cheap. We're moving into the sub-$100 for 2 TBs of disk space. Utilizing different disk-saving techniques such as efficient (and even custom) serialization and compression we can easily store more messages than a lot of systems ever dream of sending. By further breaking down our code into business services and business components, we have logically separated and can thus physically separate our message storage across multiple physical nodes.

Next, for those that plan on or have implemented solutions utilizing Greg Young's flavor of event sourcing, you're already keeping your messages around. The messages that you store in the event log are the same ones being dispatched on commit. When utilizing event sourcing we have no choice but to keep all messages because that what we use to rebuild application state.

Okay, now that we have looked at reasons why we may want to keep all of our dispatched messages around, let's consider the opposite. How can we get rid of messages as quickly as possible? How can we free up our own resources? How can we avoid our disks accidentally running out of space (and brining down the server) because our message logs consumed all available space?

In looking over the various idempotent messaging patterns one thing stands out. The reason for keeping messages around is for failure scenarios. In the situation where we are re-handling a message because our process died while handling a message, but when we got far enough for it to transform some application state, we want to be able to redispatch the messages that should have been sent had the original set of transactions been fully successful.

Because the idempotency patterns thus far have only been addressing failure conditions, how can we address the situation where there were no failures. If a message was successfully handled and dispatched a set of resultant messages, how can we avoid the needless redispatching or even the reprocessing a known duplicate message without storing every single message we've ever dispatched?

There are two parts to the proposed solution which can be applied to varying degrees depending upon how aggressive you'd like to be with your space-saving constraints. The first is to purge successfully dispatched messages and the second is to purge the record that a message was ever processed in the first place.

Because the message log contains messages that we would redispatch upon handling a duplicate message, we only need to keep those messages around so long as we are not confident that the source message was not fully processed. In other words, once we have successfully published the results of processing, we have at-least-once delivery guarantees from our messaging infrastructure. Because of this guarantees we don't need to keep these messages around anymore. They only clutter up our message log and occupying disk space.

When a message has been successfully processed we need to somehow indicate to our application-level idempotency middleware that there is no longer a need to keep messages related to the source message. While we may purge the log, we will want to keep the application-level identifier of the message that was processed. We keep this to know if we should even bother reprocessing the message should we receive it again. Keeping a list of identifiers is extremely cheap as compared to storing the contents of every dispatched message indefinitely.

Our application can indicate that a purge of the message log should occur in any number of different ways. One possible way might include using a separate application thread that receives notification when all resulting messages have been published. Once this other thread receives notification, it could then purge the contents of the message log for that message while retaining the identifier of the message that was processed.

Instead of using a different thread, we could also utilize the messaging infrastructure itself. We could tack on a "MessagesSuccessfullyDispatched" message that was sent (or maybe even published) to a receiver/subscriber queue to purge the message log. Note that MessagesSuccessfullyDispatched is naturally idempotent because processing that message multiple times results in the same value. Because it is naturally idempotent, we need not worry about apply the same patterns to the handling of this special infrastructure message.

Another possibility would be to have a completely separate process polling the message log database to see which messages could be purged because they have been successfully processed.

The last part of the solution is to have some kind of out-of-process cleanup that would occasionally wake up and purge the message log of the fact that a particular message was ever received in the first place by purging the identifiers of all messages received from the log. This purging could be done on a scheduled basis. The only question then becomes, how long do we wait before purging old message identifiers from the log? This can best be answered by understanding how long duplicate messages may be re-dispatched throughout the system. In many cases a few hours or days may be sufficient, but to be extremely safe, we could set the threshold to be something like a few weeks or perhaps even a month. This could even potentially be tuned on a per-message-type level. It all depends upon the idempotency threshold of each message type along with the space constraints and requirements of the business service upon its operational environment.