Wednesday, February 10, 2010

Database Primary keys and Unique indexes

Database Primary Keys

Primary keys play a very important role in designing a relational database schema. Primary keys uniquely identify a record type (a database table) and are often used to create a parent and child relationship between database tables. I am attempting to write whatever I know about it from the following point of views:

  • Identifying primary keys
  • Natural keys vs. Surrogate keys
  • Primary key constraint & Index (Differences in database implementation)
  • Clustered vs. non-clustered primary index
  • GUID as primary key
  • Google BigTable unique key

Identifying primary keys
  • A primary key must uniquely identify a record type and cannot contain null values.
  • One or more columns can be identified as primary key.
  • Usually static column whose value does not change over time is preferable.
  • It can be either be natural keys or surrogate keys.

Natural keys vs. Surrogate keys
  • Natural keys are the unique keys with-in the application domain, such as, identifying a user with his social security number, or a VIN number for the vehicle record type.
  • Surrogate keys are the synthetic keys, usually auto-generated by underlying database system. It can also be generated in external programs. These key values have no meaning in the application domain.
  • Its hard to identify natural keys which can serve as primary key. They are error prone to the mistyping and seldom missing at record creation. Such as wrong SSN , or VIN number missing for the impounded vehicle.
  • Surrogate keys are hard to replicate across multiple databases and need natural data or keys to match and merge the data with other systems.

Primary Key constraint & index (Differences in database implementation)
When primary key constraint is defined on a table then each backend database behaves differently

  • SQL Server always creates a unique index (by default clustered unless NONCLUSTERED keyword is used) on the primary key columns. Even if there is a unique index already defined on the primary key columns. SQL server supports multiple indexes on the same columns, one of which could be clustered and others non-clustered.
  • Oracle & Db2 both use the existing unique index on the primary key columns and if one doesn’t exist then automatically creates one. You can have only one unique index on the same columns.

-- Create a sample table
CREATE TABLE TestTable
(
  UsrID  INTEGER NOT NULL,
  UsrNam  CHAR(50) NOT NULL
);

-- Create a unique index on UsrID
CREATE UNIQUE INDEX TESTIDX ON TESTTABLE (UsrID);

-- Define a primary key constraint on UsrID
ALTER TABLE TestTable ADD CONSTRAINT TESTPK PRIMARY KEY (UsrID);

SQL Server ends up with two indexes on column UsrID, one clustered TESTPK and one non-clustered TESTIDX. Oracle & Db2 just adds the primary constraint TESTPK and uses the TESTIDX unique index

Clustered vs. non-clustered primary index
A clustered index dictates the physical storage of the data in the table. Its analogous to a telephone directory, which arranges data by last name. A nonclustered index is analogous to an index in a textbook. The data is stored in one place, the index in another, with pointers to the storage location of the data. A clustered index is particularly efficient on range based queries while non-clustered indexes are optimal choice for exact match queries.

e.g. Employee table have EmployeeID, FirstName, LastName etc. EmployeeID is the primary key and auto-generated number. SQL Server by default creates clustered index on the primary key. So the query will be very efficient of you search for all the employees whose EmployeeID is between 100 and 200. Usually you search by the names, so having clustered index on LastName, FirstName is more optimal. It is very unfortunate that SQL Server chooses on its own to create a clustered index. I see many SQL server implementation where  clustered indexes are on identity columns.

If you have Order table then having the CustomerID as the left column in the clustered index might be useful as you will be looking up orders of a specific customer, so the customers order can physically reside close together.

GUID as primary key
There are many proponent of GUIDs as primary keys. My leaning is that we should not use it everywhere just for the sake of using it. It should be based on app domain, scope and carefully considering the benefits over integer key.
  • A GUID is a 128-bit integer (16 bytes) that is unique across all databases, computers and networks. It has a very low probability of being duplicated.
  •  GUID can be advantageous in the distributed environment where syncing is needed. Client applications can generate unique IDs in the offline mode and sync it to the central storage when it comes online. There wont be any key conflict between multiple client sync.
  • Cloud storage are increasingly using GUIDs to identify entities, objects & transactions. It avoids round-trip to the storage to get a unique key.
  • More cloud storage are implementing variant of triple stores which needs more global unique keys across data types.
  • The downside is that creating an index, especially clustered, on GUID column creates heavy index fragmentation, larger index size & require more DBA maintenance (index rebuilding).

Google BigTable unique key
  • BigTable is Google's way of doing data store shared across their products and also third-party apps.
  • It’s a sparse, distributed, persistent multi-dimensional sorted map.
  • It has very creative way of creating unique string as the primary key. The map is indexed by
    • row name - arbitrary byte string
    • column key
    • Timestamp
  • Each value in the map is an uninterested array of bytes.
  • Row name could be a reversed url concatenated with other specific things in the page etc.
  • This tells us that we need more and more creative ways to create primary keys (unique keys) to the way we are doing business now.