Cassandra/Riak/Dynamo Optimistic Concurrency Control

There are a number of great NoSQL implementation each with their own set of unique advantages and quirks including maturity of the project, size of the community, and technical merit.  While not discounting the unique ways in which each NoSQL project fills a specific set of requirements, I consider one group of NoSQL projects to be a cut above the rest—the Dynamo clones.

While HBase, BigTable, CouchDB, and MongoDB are wonderful and have various levels of acceptance along with installed user base, the Dynamo clones seem to have something else.  They are truly web scale.  I say this because, unlike virtually all other NoSQL implementations, the Dynamo-based projects have no master node.  They are more of a peer-to-peer model with each node acting as a master in its own right.  While CouchDB and some others can act like a cluster of masterless nodes, the Dynamo clones have been architected around this very principle.

Why is this important?  Well, beyond the immediate benefit of no single point of failure and a shared-nothing architecture, the Dynamo clones are design to scale up and down dynamically by simply adding and removing nodes from the cluster.  The “ring” handles everything related to rebalancing or resharding based using mechanisms like gossiping, consistent hashing, hinted handoff, and Merkle trees.  In addition, they each allow tunable or configurable levels of CAP.  It’s not just all CA or AP, but how much of each.

Optimistic Concurrency Problems

While having no master node is one of the strongest and best attributes of Dynamo it is also simultaneously the biggest shortcoming.  The reason is that implementing optimistic concurrency controls against Dynamo is extremely difficult, if not impossible using Dynamo by itself.

Sure, we can get a consistent or quorum read by querying more than half the nodes that store the parituclar data element: “R > N / 2”.  In other words, we can “read our writes”.  The problem is that we can’t reject an update that’s not based upon the latest stored version.  Be design, we can’t prevent multiple writers from writing concurrent and conflicting values.

RDBMS is Dead.  Long Live RDBMS!

When certain elements within our application require strong consistency, we must implement OCC ourselves.  My team kicked this problem around for a bit but we came up with something rather unique and creative.

Why not leverage the locking (and blocking) capabilities of an RDBMS to facilitate OCC within our Dynamo cluster?  If we simply used the RDBMS to implement locking and nothing else, that wouldn’t be very taxing and require lots of memory or disk and should facilitate a significant user load.  Furthermore, the only time any client would experience blocking is when another writer was operating against the same data element within our storage, e.g. the same bank account or same user account or whatever.  In other words, we use the key of the data element as the key for the lock inside our RDBMS.

While in the strictest sense this may feel more like pessimistic locking, let me allay any concerns.  First, we’re only taking a lock on writing.  Every time we attempt to write to our Dynamo cluster, we first acquire an RDBMS lock.  Then we proceed to write using a quorum write (W > N / 2).  Once the write is complete we release the RDBMS lock.  If our process fails or dies the lock is released and other writers working against the same data element are allowed to proceed in serial fashion.

Here is a sample workflow:

  1. Instruction received that requires mutating application state.
  2. Get the key of the data element to be manipulated, e.g. user or account ID.
  3. Open up a DB transaction against our RDBMS using an isolation level of repeatable read.
  4. Insert the key into the RDMBS “locks” table.
  5. Perform a quorum write to the Dynamo cluster.
  6. Rollback the RDBMS transaction to release the lock.

All in all its pretty simple.

Too Much Effort?

The question surrounding this entire implementation is: why even utilize Dynamo at all in this case?  Why not use an RDBMS?  Isn’t this a lot of extra work?

It is true that we are creating some additional overhead and that we are introducing additional latency into the process.  The entire reason behind this approach is that we want to have strict consistency boundaries surrounding a particular kind of element.  While we try to design our application to handle eventual consistency, there are some cases where strict consistency is the easiest and perhaps best approach.

The unique advantage that Dynamo has over a traditional RDBMS, even considering this approach for strict consistency, is scalability.  That’s right, scalability.  While an RDBMS is more rigid in its ability to scale writes that must funnel through a single node (or synchronous cluster), with Dynamo we can easily scale writes by merely adding additional nodes to the ring.  In other words, we can accommodate millions of users in our Dynamo cluster by adding nodes.

One Caveat

Ironically the only difficulty we encounter with this approach other than additional latency is the availability of our lock mechanism itself—the RDBMS.  If client A obtains a lock against DB #1, and DB #1 goes down, we have a chance of client B not being able to obtain a lock or attempting to lock against a different RDBMS, e.g. DB #2.  This means client A is not blocking client B.  Problem.

The easy solution is to have multiple “lock’” servers.  Then, when the application attempts to obtain a lock, it does so against at least N lock servers where N increases based upon your requirement for strict, absolute data consistency.  Each client would attempt to connect to N lock servers in order, e.g. obtain lock against DB #1 and against DB #2.  Only when DB #1 becomes unavailable do we obtain a lock against DB #2 and #3.  Because DB #2 is still available and online, it’s still holding the lock for client A such that client B remains blocked.

Conclusion

This approach may seem like a lot of extra work and overhead but it is actually quite simple while only introducing a small amount of latency into the process.  I have bundled the above principles into a small, open-source project which is call NoSqlMutex.

Maintaining strict consistency is a difficult problem that cannot always be avoided therefore primitives such as the above are necessary for maintaining that consistency against a distributed system.

  • http://www.blogger.com/profile/08936678394177471026 Eric Hauser

    Jonathan,

    Have you looked at using ZooKeeper (http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperOver.html) for this? You would get better latency out of using ZK for a distributed lock as opposed to RDBMS.

    I recently implemented a native ZK client for .NET. It isn't quite production ready yet, but it is close: http://github.com/ewhauser/zookeeper

  • http://basho.com Sean Cribbs

    On Riak, you can implement a sort of optimistic locking using conditional PUT operations on the HTTP interface — however, in general you don't want to. Optimistic locking (or MVCC) by its definition sacrifices write availability. For the applications that need that feature, you'd likely want to bubble up conflicts to the user anyway (or at least to the application).

  • http://www.blogger.com/profile/16836313591238262040 Jonathan Oliver

    Eric,

    Thanks for posting about ZooKeeper. Yes, I'm familiar with it. The basic premise is to facilitate distributed locking using a quorum.

    There are a few reasons why I didn't go the ZK route. One reason is that I didn't want to create a dependency upon yet another system. With RDBMS being a given in virtually all organizations, it's something that's already in place and already well understood.

    The other reason that I'd argue against ZK is actually performance. We're not actually obtaining a true, distributed/quorum lock. While we are using the lock for a distributed system, we only need to obtain a single lock against single, well-known resource (or perhaps two resources to ensure lock safety). Despite obtaining a lock against a shared resource we still have no single point of failure because we simply attempt to obtain each lock against a common, ordered list of lock RDBMS instances. If one instance is unavailable, we try the next. This works the same for all nodes.

    Fortunately, I also view the specific mechanism for locking as a swappable implementation detail. We should be able to swap in/out whatever mechanism—be it RDBMS locking or ZK locking—by adding some infrastructure code rather than application-specific code.

  • http://www.blogger.com/profile/16836313591238262040 Jonathan Oliver

    @Sean,

    I looked closely at each Dynamo implementation, especially Riak, but the issue was that each node was independent and could receive a different write from a different client with a conflicting version and we'd still end up with 2+ versions in the bucket.

    In the end you are correct that we sacrifice write availability. The difference for the architectures I work with is that all commands/instructions from the user are queued in a durable queue. As such we don't actually lose the intention of the user. We just try again if the write fails. If the "world has moved" because someone else made the update first thus making the command from the user irrelevant or conflicting, we can inform them asynchronously.

  • Uwe

    Jonathan,
    did you include Scalaris during your investigation? Although transactional and uses a modified paxos for commits, if you have only single writes to lock, you could commit directly after the write; reads are then ignored anyway. That should be similar to your approach, right?

  • http://www.blogger.com/profile/16836313591238262040 Jonathan Oliver

    I've looked at Scalaris (Scalarix) as well. The one drawback was that it is *always* transactional and falls on the CA side of CAP.

    The draw of the Dynamo clones is that you can choose on per-bucket (or even per-query) basis how much of each CAP property you want. The techniques above are only when you are experiencing concurrent writes and require full consistency where concurrent writes to the same bucket element is undesirable.

  • http://www.blogger.com/profile/08936678394177471026 Eric Hauser

    Jonathon,

    I certainly understand the desire not to introduce a new dependency. However, there are well respected algorithms for distributed concensus (i.e. Paxos). It is almost certain that there will be scenarios in which your lock infrastructure will not achieve the desired concensus that you want.

  • http://www.blogger.com/profile/16836313591238262040 Jonathan Oliver

    @Eric,

    If you look closely at the implementation above you'll notice that we're not actually achieving distributed consensus. We're use local (and blocking) consensus against one or more independent resources in serial fashion.

    You are correct that is has not been as heavily researched and formalized into a protocol like Paxos, but the principle still holds that so long as I have a lock against N resources, nobody else gets through until I release the lock.

    Even so, the mechanism to achieve consensus should be swappable–be it ZK or otherwise such that one can choose based upon a specific need.

    The purpose of this article was merely to outline that an alternative mechanism can be created with existing infrastructure. ZK is a great suggestion should anyone decide to utilize it.

    One parting thought is that these are not meant to be long-running locks. These are short-lived locks that exist only long enough to commit a transaction to our Dynamo cluster using a quorum write.

  • http://bogdanbrinzarea.wordpress.com Bogdan Brinzarea

    Hi Jonathan,

    Have you considered using DHTs instead of RDBMS?

    Bogdan

  • http://www.blogger.com/profile/16836313591238262040 Jonathan Oliver

    @Bodgan,

    That could work too. Is there a particular one that you have in mind?