Wednesday, February 10, 2010

Database Primary keys and Unique indexes

Database Primary Keys

Primary keys play a very important role in designing a relational database schema. Primary keys uniquely identify a record type (a database table) and are often used to create a parent and child relationship between database tables. I am attempting to write whatever I know about it from the following point of views:

  • Identifying primary keys
  • Natural keys vs. Surrogate keys
  • Primary key constraint & Index (Differences in database implementation)
  • Clustered vs. non-clustered primary index
  • GUID as primary key
  • Google BigTable unique key

Identifying primary keys
  • A primary key must uniquely identify a record type and cannot contain null values.
  • One or more columns can be identified as primary key.
  • Usually static column whose value does not change over time is preferable.
  • It can be either be natural keys or surrogate keys.

Natural keys vs. Surrogate keys
  • Natural keys are the unique keys with-in the application domain, such as, identifying a user with his social security number, or a VIN number for the vehicle record type.
  • Surrogate keys are the synthetic keys, usually auto-generated by underlying database system. It can also be generated in external programs. These key values have no meaning in the application domain.
  • Its hard to identify natural keys which can serve as primary key. They are error prone to the mistyping and seldom missing at record creation. Such as wrong SSN , or VIN number missing for the impounded vehicle.
  • Surrogate keys are hard to replicate across multiple databases and need natural data or keys to match and merge the data with other systems.

Primary Key constraint & index (Differences in database implementation)
When primary key constraint is defined on a table then each backend database behaves differently

  • SQL Server always creates a unique index (by default clustered unless NONCLUSTERED keyword is used) on the primary key columns. Even if there is a unique index already defined on the primary key columns. SQL server supports multiple indexes on the same columns, one of which could be clustered and others non-clustered.
  • Oracle & Db2 both use the existing unique index on the primary key columns and if one doesn’t exist then automatically creates one. You can have only one unique index on the same columns.

-- Create a sample table
CREATE TABLE TestTable
(
  UsrID  INTEGER NOT NULL,
  UsrNam  CHAR(50) NOT NULL
);

-- Create a unique index on UsrID
CREATE UNIQUE INDEX TESTIDX ON TESTTABLE (UsrID);

-- Define a primary key constraint on UsrID
ALTER TABLE TestTable ADD CONSTRAINT TESTPK PRIMARY KEY (UsrID);

SQL Server ends up with two indexes on column UsrID, one clustered TESTPK and one non-clustered TESTIDX. Oracle & Db2 just adds the primary constraint TESTPK and uses the TESTIDX unique index

Clustered vs. non-clustered primary index
A clustered index dictates the physical storage of the data in the table. Its analogous to a telephone directory, which arranges data by last name. A nonclustered index is analogous to an index in a textbook. The data is stored in one place, the index in another, with pointers to the storage location of the data. A clustered index is particularly efficient on range based queries while non-clustered indexes are optimal choice for exact match queries.

e.g. Employee table have EmployeeID, FirstName, LastName etc. EmployeeID is the primary key and auto-generated number. SQL Server by default creates clustered index on the primary key. So the query will be very efficient of you search for all the employees whose EmployeeID is between 100 and 200. Usually you search by the names, so having clustered index on LastName, FirstName is more optimal. It is very unfortunate that SQL Server chooses on its own to create a clustered index. I see many SQL server implementation where  clustered indexes are on identity columns.

If you have Order table then having the CustomerID as the left column in the clustered index might be useful as you will be looking up orders of a specific customer, so the customers order can physically reside close together.

GUID as primary key
There are many proponent of GUIDs as primary keys. My leaning is that we should not use it everywhere just for the sake of using it. It should be based on app domain, scope and carefully considering the benefits over integer key.
  • A GUID is a 128-bit integer (16 bytes) that is unique across all databases, computers and networks. It has a very low probability of being duplicated.
  •  GUID can be advantageous in the distributed environment where syncing is needed. Client applications can generate unique IDs in the offline mode and sync it to the central storage when it comes online. There wont be any key conflict between multiple client sync.
  • Cloud storage are increasingly using GUIDs to identify entities, objects & transactions. It avoids round-trip to the storage to get a unique key.
  • More cloud storage are implementing variant of triple stores which needs more global unique keys across data types.
  • The downside is that creating an index, especially clustered, on GUID column creates heavy index fragmentation, larger index size & require more DBA maintenance (index rebuilding).

Google BigTable unique key
  • BigTable is Google's way of doing data store shared across their products and also third-party apps.
  • It’s a sparse, distributed, persistent multi-dimensional sorted map.
  • It has very creative way of creating unique string as the primary key. The map is indexed by
    • row name - arbitrary byte string
    • column key
    • Timestamp
  • Each value in the map is an uninterested array of bytes.
  • Row name could be a reversed url concatenated with other specific things in the page etc.
  • This tells us that we need more and more creative ways to create primary keys (unique keys) to the way we are doing business now.

5 comments:

  1. Nice article! My only comment is that I think clustered indices on SQL Server tables are almost always good practice, which is why SS will create one from the primary key if you do not specify a clustered index yourself. If a table does not have a clustered index, then it is simply a heap table. If the data in a heap table becomes fragmented, the only way to defragment it is to copy the data into a new table, truncate the original one, and copy the data back in. If there is a clustered index on the table, you can just rebuild or reorganize the clustered index (Run ALTER INDEX REORGANIZE or ALTER INDEX REBUILD if you have SS 2005 Enterprise Edition), and your table is defragged.

    ReplyDelete
  2. A clustered index stores the actual data rows at the leaf level. So when a row is added to the clustered index the data goes into a specific leaf page. If that page is full, a new page is allocated and some of the rows are also moved from the full page to this new one. This causes page-split and leads to fragmentation. Hence requires rebuilding of index often. However when a row is added to a heap table - no specific page is required and the new row is added whichever page space is large enough. Although there are people who believe the benefits outweigh the concerns here as there are other benefits to having clustered index on every table, one huge one is the one you pointed out, that DBA have tools at their disposal to manage the clustered index more effectively than the heap table. My point in this blog was that the primary key is not the most optimal choice for having a clustered index. Clustered index could be a tool for organizing the physical table while data access should be done through non-clustered indexes.

    ReplyDelete
  3. A nice article on clustered vs. heap
    http://msdn.microsoft.com/en-us/library/cc917672.aspx#EBAA

    ReplyDelete
  4. An interesting point you made, but the in general in the SQL Server world, a clustered table is more efficient than a heap in a number of ways, since it reflects the physical layout of the table. I agree that the Primary Key should not necessarily be the clustered index, but one should be there on the column(s) that order the table in the most logical manner that fits its use in application(s). If you pick the correct columns, then the index can be used not only for ordering the table, but as efficient data acess as well. Interestingly, in the article you listed, MS final recommendation was that a tables should have a clustered index, not only for performance reasons, but also clustered tables are much more efficient in storage and deletion of rows than heap tables. This is actually a good article on clustered indexes as well: http://www.sqlservercentral.com/articles/Performance+Tuning/clusterthatindex/952/

    ReplyDelete
  5. hi mr, can you answer this question:
    Scenario: Consider the following scenario modelling project management activities in an organization.
    GreatHome is a real estate firm that lists property for sale. It has a number of sales offices in several cities (eg. Muscat, Sohar, Nizwa, etc). Data held on sales office are office number, which is unique for every sales office and location. Each sales office is assigned one or more agents. Agent id, name, hire date and position are the data held for each agent. An agent must be assigned to only one sales office. For each sales office, there is always one agent assigned to manage that office. An agent may manage only the sales office to which he/she is assigned. GreatHome lists property for sale. Information about property includes property id and location, which is made up of way number, city and postal code. Each unit of property must be listed with one and only one of the sales offices. A sales office may have
    any number of properties listed, or may have no properties listed. Each unit of property has one or more owners. Data held on owners are owner id and owner name. An owner may own one or more units or property. Since a property has one or more owners, GreatHome wants to record the percentage owned by each owner for a particular property.?
    send to my email great.111@hotmail.com

    ReplyDelete