Developer Justin James provides an overview of the basics of normalization and de-normalization, with an emphasis on the Third Normal Form. via blogs.techrepublic.com.com
One of the central ideas in relational databases is the concept of normalization. Normalization and de-normalization play a significant role in your ability to write applications that perform well and tolerate data changes without creating inconsistencies. In this article, I will talk about the basics of normalization and de-normalization, so that you can get a handle on this topic.
The field of database theory contains a number of normalization levels; however, when the typical developer or DBA says that a database is “normalized,” they are referring to what is called the Third Normal Form or 3NF (there are quite a few “Normal Forms”). This is what I will focus on in this column.
In 3NF, each table has a key column, and the only actual data in the table are things that directly belong to that key’s concept; this is very similar to object-oriented theory. Table A is an example dataset that is not properly 3NF.
Not normalized. (Click the image to enlarge.)
Because Table A is not normalized, we create the following problems for ourselves:
- If a customer record ever changes, we need to make the change in every invoice record in the table, which leaves a lot of room for error.
- If another table needs to refer to customers, it will either need to refer to invoices (which is confusing and does not create a canonical record for each customer) or replicate the customer data elsewhere, creating two sources of customer data.
- Making a change to a customer record involves locking every row with that customer, which can be problematic.
- There’s no way of uniquely identifying any given customer.
To normalize this table, we identify each unique record type; in this example, we have two record types (invoice and customers) within each actual record. Next, we split those records into separate tables and give each table a unique ID. This ID should be a value that is wholly independent from the record because, even if the business logic says that the data cannot change, the business logic could change in the future. Additionally, the ID value (a “primary key”) should be one that the database can easily look for and allow for easy partitioning of data between servers if needed; as such, using an integer value with an autoincrementer is ideal for primary keys. After normalization, we will have two pairs of tables (Table B and Table C).
Normalized invoice table. (Click the image to enlarge.)
Normalized customer table. (Click the image to enlarge.)
As you can see, we now have two tables that only contain the data relevant to the table. As a result, if we have another table that needs to deal with customers, it can refer to the independent customer table, which will eliminate data duplication; also, each customer only needs to be updated in one place. In addition, we can now uniquely identify customers. As a bonus, we save a ton of space in the database.
This all sounds great, right? It sure does. But as we know, there are rarely any free lunches in the world of software development. In this case, the cost of this tasty meal is performance, particularly for reading and aggregate operations. Searching goes quite quickly, thanks to all of those convenient primary keys. But let’s pretend that we want to make a sales report at the end of the month that looks like our pre-normalization table; we’ll need to execute a long-running query (after all, we’re grabbing a month’s worth of data) that will need to perform JOIN operations on all of these tables. In the process, we will need to lock all of those rows and generally jam up the database in the process. There are good reasons why reports get run at night, and this is one of them.
We resolve this problem with a data warehouse. (I am not going to go far into data warehouse theory, which is a deep topic.) A data warehouse is designed for pure read speed and typically looks like a de-normalized table. In fact, the way a data warehouse usually works is to periodically take a “snapshot” of the online, relational database used for the day-to-day work and create a variety of de-normalized tables, each designed to meet a particular reporting need. From there, reports are run against the data in the warehouse. This way, the long running, resource-hungry reports can be run against tables that are optimized for those reports’ needs without tying up the online database. The drawback is significant expense in having extra database servers and storage, and the fact that you are reporting against data that are not real-time accurate.