Idempotency Patterns

In several of my previous posts I discussed how to avoid the overhead of a two-phase commit while still being able to maintain an application-level transactional consistency between two resources, such as a message queue and durable data store, e.g. a database. When considering how to implement idempotency there are a number of factors to consider. Let us examine each of these in turn.

First, is the message inherently idempotent? That is, does it contain and operation that contains a fixed value at a point in time such as a stock quote rather than a transformational instruction such as "add $10 to balance". If the message is naturally idempotent, then much of the following discussion can be ignored because reprocessing the message multiple times results in the same value, e.g. f(x) = f(f(x)).

If the message is not intrinsically idempotent, we will want to take steps to mitigate or resolve the effects of processing the same message multiple times. In other words, we want artificial idempotency. In this case, we need to consider the computational overhead of reprocessing a particular message. Understanding this cost can be important because it helps us determine if we must proactively or reactively handle application-level message idempotency.

As mentioned in my previous post, one requirement for creating idempotency in messages that aren't idempotent is to establish a message store. This message store is keyed off of the application-level identifier of the message being processed. It contains a list of messages that have been dispatched as a result of processing the message being received. Hereafter we shall simply call this a message store.

When the computational overhead of reprocessing a message is high as compared to querying to determine if a message has been processed and the probability of a message being received more than once, we will want to proactively filter messages that have already been processed. Remember that the reasons for receiving the same message twice can be varied and range from unexpected process termination to other applications (including our own) redispatching messages.

When proactively filtering messages, we simply query our message store to determine if the message being received has already been processed. If so, redispatch the messages loaded from the query and discontinue processing the current message. Note that the way we store messages in our data store very much depends upon the transactional characteristics of the data store itself. The specifics are outlined in my previous post.

When we reactively filter a message we know that the computational overhead of processing is very low and/or the probability of a duplicate message is relatively low. In this scenario, we simply reprocess as normal. Then, when we attempt to save any to-be-dispatched-messages resulting from processing to our data store, we catch unique key violations (or similar exceptions) thrown by our data store. If we catch an exception, we abandon our current unit of work and redispatch the messages saved to the data store.

The above scenarios are all fine and good, but what about if we aren't operating against a transactional resource? What if we are performing an operation against a web service or some other RPC-based call using a non-reliable protocol? How can we know if the operation was successful? This requires a little bit more work but still falls within the same general set of patterns outlined herein. Generally we will want to follow the proactive approach discussed above but with a few notable exceptions.

The main theme surrounding this last "pattern" is that we cannot truly wrap all work in a unit and submit it as a batch because we are operating against a non-transactional resource. One possible solution would simply be to query the non-transactional resource to determine the state for a given message. One problem is the latency involved when querying remote resources such as web services over HTTP. Other considerations may include that some HTTP calls incur a fee on a per-call basis making it prohibitively expensive to perform a status-checking query for every single message. Lastly, some resources may not even support a querying operation.

At this point, we probably should disambiguate the meaning of non-transactional resource a little bit more. Although the resource may be transactional in the sense that it did or did not perform an operation, it's effectively all or nothing for only a single call, such as a web service. Furthermore, our application may not even know that the remote invocation succeeded or failed because the response never made it back to our application. Or, our process may have terminated after the "request" packet had been sent across the wire but before any "response" packet was received.

Further, just because we're working with a non-transactional resource, doesn't mean that we can't effectively simulate transactional behavior by recording the application status to our own durable resource.

In the cases where the resource supports queries, we can implement a few techniques to avoid excessive and sometimes expensive calls. The first technique is to record the status of the call at each step to a durable resource. For example, when we are about to call a web service we record that fact. When we receive the results of a web service call, we also log that fact. Finally when we are about to dispatch a message containing the results of the web service call, we log that message to our message store. As we receive a message from our own queue and process it, we must check the log to determine what work, if any, has been done for this particular message. If we have previously logged that we called the web service but no corresponding log entry for the results of the web service call was recorded, we can then incur the additional overhead of querying the resource to see what happened and then taking appropriate action.

Let's look at the above for something like calling a shipping web service from FedEx or UPS or perhaps processing a credit card transaction. If our application has logged that we called the web service but there is no log entry containing the results, we can query the web service and ask what has been done for credit card ABC or order 123. Depending upon the results we can take appropriate action such as authorizing a credit card if the transaction was never received or picking up where the previous attempt failed.

The last "pattern" that merits our consideration is when the resource doesn't support query operations. This is where things get a little bit sticky because we have no way to know if the work was really performed. The canonical example is sending an email by connecting to an SMTP server. Like the previous scenario of calling FedEx or UPS, we can perform virtually all of the same steps. The key difference is that we are unable to query to determine the state of the resource involved. In this case, we may have to accept a small level of inconsistency for a very specific failure condition.

The inconsistency being referred to is the possibility of an email being sent twice (continuing with the SMTP example). If our process terminates after us logging that we are going to call the SMTP server (and perhaps after the SMTP server was called), but before we are able to log that the SMTP server was called, we have absolutely no way to know what work has been done. This being the case we have no choice but to perform the operation again. At the same time, is it really that bad if an end user gets an email twice?

Conclusion

I hope you will find some of these patterns useful when building applications using messaging patterns using guaranteed at-least-once delivery provided by messaging infrastructure.

Granted, these are only brief summaries of some of the various "patterns" for making messages idempotent. The examples are a little bit arbitrary and contrived and we might easily poke holes in them, but the principles still hold.