Monthly Archives: January 2009

Integers vs. GUIDs and Natural vs. Surrogate Keys

In the database world, there are few arguments that provoke such strong feelings as the proper choice for a database key.  Should we choose a natural or surrogate key?  If we choose a surrogate key, should it be an integer or a GUID?  I have actually done work on all sides of the argument, so I can see it from the various perspectives.

Natural vs. Surrogate

Joe Celko argues that we should be using a natural key–one composed from the attributes of the entity being modeled rather than just assigning an arbitrary value.  I agree with him, but don’t stop reading yet.  We should keep our keys thin (small) rather than creating a massive composite key, whose values would also be necessary in other tables in order to perform a join.  How do I reconcile these two arguments?  Easy, I do both.

Allow me to elaborate.  I actually have the database’s primary key be the natural key, but I create a meaningless, arbitrary surrogate key that can be used for joins in related tables, among other things.  In this way I get good performance on joins as well as an easy way to select rows, but I enforce the natural key of the thing being modeled to ensure that two aren’t accidentally inserted at the same time.

Integers vs GUIDs

To start, can we please say it "gwid" rather than "goo-id"?  This aspect of the argument is also heated, but the irony is that the surrogate key camp is fighting amongst itself.

I have used both extensively, but recently there has emerged a clear winner: the GUID.

The main argument between ints vs GUIDs stems from space usage–4 bytes vs. 16 bytes–yes, GUID are 16 bytes, not 36.  The argument is that the database is faster when working with 4 byte keys.  True, the integer it can be marginally faster, but it has a number of significant disadvantages.  (Update: The other compelling argument is related to index fragmentation and page splits.  This is easily solved by using a GUID “COMB”—see the comments for more information.)

First of all, databases are smart.  When doing key comparison, they don’t have to compare all 16 bytes in order to determine inequality among keys, they stop comparing at the first bit that differs.  Furthermore, when performing a select+join, rarely do I want to pull an object with thousands or millions of rows associated with it–when the size difference between integers and GUIDs would start to become measurable.  In most applications, when you perform a select, it’s only retrieving a dozen rows or so–perhaps maybe a few hundred, but much beyond that you’re properly returning too much for the user to digest and all of the associated object creation would become the bottleneck instead of the key comparison.

Second, in any mid-size to large system, e.g. eBay, PayPal, etc. you’re going to need an 8-byte number, e.g. BigInt, otherwise you’ll exhaust all available values in the database.  Ever had to quickly update your database schema and application code to handle 8-byte long values?  A local company, Authorize.Net, famous for gateway processing, had this problem because they’re used signed, 4-byte integers.  They had to send out a notice to all there customers of the change.  Fortunately I treat all their return values and identifiers values as strings, so we weren’t affected.  But your application won’t ever get this big right?  I hope all of my projects become wildly successful and I plan accordingly.

Third, and this is the biggest one for me, in a 100% fully distributed system we want to avoid a single point of contention by creating a "shared nothing" architecture–including the database.   By having a GUID rather than an integer, we can have the application code generate the object’s persistence identifier and then persist to the database without first talking to the database.  This removes additional burden from the database, reduces latency, improves application performance, allows for occasionally connected systems, and removes the database as a shared resource in a distributed system.  Note that using MySQL Cluster you can then geo-distribute your system (but you’ll still pay for latency).  In a geo-distributed system you definitely don’t want to have auto-increment integers because you’re just asking for problems…big, big problems.

Fourth, a GUID adds a small measure of security by obscurity.  About two years ago we ordered a few Dell machines for our office.  When I got to the checkout’s "thank you" page, I noticed the parameter "&orderId;=12345678" in the query string.  Just for kicks I subtracted one from the number to yield 12345677 and then I pressed enter.  All of a sudden on my screen I could see someone else’s order information–their name, email, shipping and billing address as well what they ordered.  I did this a few more times and sure enough, each time I got a lot of customer information.  I thought to myself, I could write a script to harvest all of the order information.  Then I thought, nah! and continued with my day.  This was Dell we’re talking about!  They didn’t have proper authorization checks in their code.  One small advantage to GUIDs is the security by obscurity value.  Of course you want to code your application properly, but it’s just a small, additional layer of protection that it gives you.

Fifth, GUIDs can help ensure that you have written your database joins (via NHibernate) properly.  Back when I used to hand-code stored procedures, I would get a join wrong and get some weird results that may or may not have been quickly discovered.  By using GUIDs your joins won’t work unless you code them properly.

Conclusion

The arguments for on both sides are often compelling.  We have simply found that the use of GUID makes life a lot easier and allows us to focus on other things rather than something as silly, yet mission critical as a primary key.

——————-

UPDATE: While I am still a firm believer in GUIDs, it’s interesting to see what Twitter does for their IDs:

http://www.royans.net/arch/distributed-systems-and-unique-ids-snowflake/

https://github.com/twitter/snowflake

DDD: Your Entity Object Is My Value Object

As a quick primer, an entity object is something that has a identity and can change over time–much like a person–whereas a value object is something that is immutable once constructed and can be interchanged with another value object holding the same value.

I have read, from time to time, the laments of some who would like to see a published list of the all of the various kinds objects–and the classification of each: entity or value object.

Here’s the problem: Your entity objects are my value objects.  In other words, what you consider to have an innate and distinguishable identity, doesn’t particularly concern me.  Perhaps a small, yet highly valid example will suffice.

At the grocery store when you, as a consumer, purchase a gallon of milk, do you particularly care which one you purchase?  Granted, you may look at the expiration date and pull the one from the back–but what if the expiration dates and milk fat values are the same?  Mmm….milk fat….ahhh.  In this situation it doesn’t matter.  You arbitrarily pick one because you don’t care.

Let’s flip the coin–or should I say cow–over on this one.  What about the company that produces the milk?  What happens when a bad batch of milk is identified?  The producer cares very much which container of milk is which.  To them, the container most definitely an entity object–it has identity that may be distinguished from the other containers.  Because it’s an entity object and has identity, the producer can easily notified the retailers to pull the affected containers of milk.

So which one is your object?  Entity or value?  Obviously it depends…on your domain.

Holy Wars: C# vs Java, Windows vs Linux

I was re-reading The 7 Habits and I came upon an interesting principle.  In Habit 2, Begin With the End in Mind, Stephen Covey talks about how change can affect us.  Specifically, he mentions how, by having a changeless, principle-centered core, we don’t have to be afraid of change and that we can adapt and grow with change rather than fearing it or labeling/prejudicing ourselves against it.

The thought occurred to me, how much does the lack of this principle permeate the development community?  We see manifestations of it in the petty arguments of C# vs. Java vs. PHP vs. Ruby vs. C++, etc.  We see it in the Windows vs. Mac vs. Linux camps.

When we have a changeless, principle-centered core we see how truly petty these arguments are.  Is there not value and learning to be had from each of the communities?  Are there not intelligent and capable individuals in each arena?  There is no one way [to rule them all] to develop and write code.

With a "my way or the highway" attitude, the ALT.NET community would never have gotten started. Much of the learning and principles found therein have there origins in Java.  Much of the reading that I do related to SOLID development principles have been created by authors prior to–or early in–the existence of the .NET Framework, e.g. Martin Fowler, Eric Evans, Kent Beck, Robert Martin, etc.

If our core values are centered around quality software and creating the best system possible then we aren’t scared to use the appropriate tools for the job.  I’m not scared of using Linux servers running MySQL Cluster or memcached with an ASP.NET frontend.  I don’t have to be afraid to use Rake (Ruby) to run my build process on my JetBrains TeamCity build server.  It’s all about using capable tools for the job.

We shouldn’t be afraid of using Windows either.  It is a very capable operating system running mission-critical applications.  Despite all of the bad press, it’s still extremely solid.

My point is [blue screen]…

UPDATE: 2008-01-16 – After posting this, I found a post on If Programming Languages Were Religions which proves my point.

Fedora 10 x64 and CERC 1.5 6-Channel SATA RAID

Fedora doesn’t play nice with this RAID card.  After installation it wouldn’t recognize the hardware device.  It would just say:

Volume group "VolGroup00" not found

After a bit of Googling, I came upon this post.  The problem occurs because Fedora is loading too quickly and not giving sufficient time for the RAID card to be recognized.  The solution is to boot into recovery mode using the Fedora installation disk.

Once at the command prompt in recovery mode you’ll want to change the mounted root: chroot /mnt/sysroot

Then run:

mkinitrd -f –with=scsi_wait_scan /boot/initrd-2.6.27.5-117.fc10.x86_64.img 2.6.27.5-117.fc10.x86_64

You’ll need to determine which version of Fedora you’re using and change the numbers in the command accordingly.  On a freshly installed FC10 using the original installation media (x64), the above command should work.

HOWTO: Get 4 Extra Hours Every Day…

About two months ago, I was thinking of the goals I had that I wanted to accomplish as well as all of the projects that I wanted to complete.  There just wasn’t enough time.  It occurred to me that there several immediate steps that I could take to gain a few hours every day.

Wake Up Earlier

I love to sleep.  Let me repeat that, I absolutely, positively love to sleep.  I used to sleep till 7:00 AM–a respectable time to get up.  Now I get up at 5:00 AM every day.

If we do the math, I gain an extra two hours every single day over getting up at 7:00 AM.  This means I get an extra 14 hours of free time per week, 2.5 days every month, and 30 days every year!

Let me repeat that: By getting up at 5:00 AM, I get an extra month of time each year!  What would you do with that extra time?

Although you may say that the same argument can be made for staying up later, I find that there is a victory over self by getting up at 5:00 AM.  I also find that things are much quieter at 5:00 AM rather than 11:00 PM.  What’s more, those who stay up later almost always choose to sleep in thus eliminating the extra time they would have otherwise gained.  In other words, getting up consistently at a specific time wins hands down over staying up later. 

Here’s the strategy to waking up: Just stand up and start your day.  If you even consider going back to bed, you’re toast.  That’s because you’ll end up justifying everything in your half-asleep head.  If you  stand up and start some process, such as reading email or whatever, you won’t have any problems.

It’s in the Bag

How much time to you spend going to lunch, waiting in line/for the server, eating, etc.?  30 minutes?  1 hour?  How would you like that time back?  If you pack your own lunch, you can easily save up to one hour of your time per work day.

It’s also a great way to lose weight.  If you pack your lunch properly you’ll be able to plan your diet rather than leaving it up to the whims your stomach when you order a meal.

Stagger Your Routine

Whether we realize it or not, we are all very much like lemmings or sheep.  We just follow the normal routines.  We all leave for work at pretty much the same time; we all go to lunch at the same time; we all come home at the same time.  Why do you think there’s so much traffic and congestion on the freeway?  Why do you think restaurants are so busy?

If you are able, change the time that you do things.  For example, I typically get to work a little bit earlier.  I try to avoid being in traffic on the hour marks–the same works for going home. Because you’re coming in earlier you should have a little bit more leeway to leave a touch earlier.  This technique saves me about 15-30 minutes every day, depending up traffic.

Here’s another tip: When you’re in your vehicle, are you listening to music or the radio?  What about all of that reading you wanted to accomplish?  Why not listen to books on tape/CD?  Often times I’ll listen to them twice so that I can absorb the material a little bit better.

Focus

This last tip has the ability–more so even than waking up–to help you accomplish those things that you need to accomplish.  Unless followed, it has the ability to take away the time you’ve gained by applying the other ideas.  It has to do with focusing on your goals and projects.

The Internet is a wonderful tool and an incredible time waster.   How many times do you find yourself being drawn away to this site or that?  Often times, while my computer is processing something, I find myself going to various sites–Yahoo News, Google News, among others.

Rather than doing this, why not tackle another small project while your computer is busy?  I’ve recently started bringing my laptop to work so that I can have a dedicated computer on the side while my primary workstation is involved in whatever.  It’s nice to have a machine on the side where I can be reading CodeBetter or LosTechies.

Sometimes you’ll find valid things that you’ll want to investigate in the course of your research on the Internet.  Write it down.  Come back to it later–stayed focused on the task at hand.

I typically allocate about 30-45 minutes per day to reading various technical blogs at the beginning of the day.  You might say that this is a waste of time.  You’d be wrong.  It’s because of the amount of reading that I have done that I discovered the ALT.NET movement, SOLID principles, Inversion of Control and Dependency Injection, NHibernate, and various other invaluable development principles.

What I’m saying is that you need to stay focused on the task at hand.  If that task involves research on the web, that’s fine.  It’s when you click on that link going to a page about Bennifer 2.0 or Angelina Jolie’s new baby that you are wasting time.

Conclusion

The above techniques are situationally specific and may or may not apply to you.  The principles, however, are universal–that of mastering self and thinking about the routine in your life.  For more in-depth discussion on these principles I recommend absorbing–not just reading–The 7 Habits.

For me, all of the above has combined to create, on average, about four more hours each day.  In other words, my year is 14 months long compared to my previous routine.  I get an extra two months every single year!  What do I do with all of my extra time?  I write blog posts.