Integers vs. GUIDs and Natural vs. Surrogate Keys

In the database world, there are few arguments that provoke such strong feelings as the proper choice for a database key. Should we choose a natural or surrogate key? If we choose a surrogate key, should it be an integer or a GUID? I have actually done work on all sides of the argument, so I can see it from the various perspectives.

Natural vs. Surrogate

Joe Celko argues that we should be using a natural key--one composed from the attributes of the entity being modeled rather than just assigning an arbitrary value. I agree with him, but don't stop reading yet. We should keep our keys thin (small) rather than creating a massive composite key, whose values would also be necessary in other tables in order to perform a join. How do I reconcile these two arguments? Easy, I do both.

Allow me to elaborate. I actually have the database's primary key be the natural key, but I create a meaningless, arbitrary surrogate key that can be used for joins in related tables, among other things. In this way I get good performance on joins as well as an easy way to select rows, but I enforce the natural key of the thing being modeled to ensure that two aren't accidentally inserted at the same time.

Integers vs GUIDs

To start, can we please say it "gwid" rather than "goo-id"? This aspect of the argument is also heated, but the irony is that the surrogate key camp is fighting amongst itself.

I have used both extensively, but recently there has emerged a clear winner: the GUID.

The main argument between ints vs GUIDs stems from space usage--4 bytes vs. 16 bytes--yes, GUID are 16 bytes, not 36. The argument is that the database is faster when working with 4 byte keys. True, the integer it can be marginally faster, but it has a number of significant disadvantages. (Update: The other compelling argument is related to index fragmentation and page splits. This is easily solved by using a GUID "COMB"—see the comments for more information.)

First of all, databases are smart. When doing key comparison, they don't have to compare all 16 bytes in order to determine inequality among keys, they stop comparing at the first bit that differs. Furthermore, when performing a select+join, rarely do I want to pull an object with thousands or millions of rows associated with it--when the size difference between integers and GUIDs would start to become measurable. In most applications, when you perform a select, it's only retrieving a dozen rows or so--perhaps maybe a few hundred, but much beyond that you're properly returning too much for the user to digest and all of the associated object creation would become the bottleneck instead of the key comparison.

Second, in any mid-size to large system, e.g. eBay, PayPal, etc. you're going to need an 8-byte number, e.g. BigInt, otherwise you'll exhaust all available values in the database. Ever had to quickly update your database schema and application code to handle 8-byte long values? A local company, Authorize.Net, famous for gateway processing, had this problem because they're used signed, 4-byte integers. They had to send out a notice to all there customers of the change. Fortunately I treat all their return values and identifiers values as strings, so we weren't affected. But your application won't ever get this big right? I hope all of my projects become wildly successful and I plan accordingly.

Third, and this is the biggest one for me, in a 100% fully distributed system we want to avoid a single point of contention by creating a "shared nothing" architecture--including the database. By having a GUID rather than an integer, we can have the application code generate the object's persistence identifier and then persist to the database without first talking to the database. This removes additional burden from the database, reduces latency, improves application performance, allows for occasionally connected systems, and removes the database as a shared resource in a distributed system. Note that using MySQL Cluster you can then geo-distribute your system (but you'll still pay for latency). In a geo-distributed system you definitely don't want to have auto-increment integers because you're just asking for problems...big, big problems.

Fourth, a GUID adds a small measure of security by obscurity. About two years ago we ordered a few Dell machines for our office. When I got to the checkout's "thank you" page, I noticed the parameter "&orderId;=12345678" in the query string. Just for kicks I subtracted one from the number to yield 12345677 and then I pressed enter. All of a sudden on my screen I could see someone else's order information--their name, email, shipping and billing address as well what they ordered. I did this a few more times and sure enough, each time I got a lot of customer information. I thought to myself, I could write a script to harvest all of the order information. Then I thought, nah! and continued with my day. This was Dell we're talking about! They didn't have proper authorization checks in their code. One small advantage to GUIDs is the security by obscurity value. Of course you want to code your application properly, but it's just a small, additional layer of protection that it gives you.

Fifth, GUIDs can help ensure that you have written your database joins (via NHibernate) properly. Back when I used to hand-code stored procedures, I would get a join wrong and get some weird results that may or may not have been quickly discovered. By using GUIDs your joins won't work unless you code them properly.

Conclusion

The arguments for on both sides are often compelling. We have simply found that the use of GUID makes life a lot easier and allows us to focus on other things rather than something as silly, yet mission critical as a primary key.

-------------------

UPDATE: While I am still a firm believer in GUIDs, it's interesting to see what Twitter does for their IDs:

http://www.royans.net/arch/distributed-systems-and-unique-ids-snowflake/

https://github.com/twitter/snowflake