Database Denormalization: Understanding Relational Databases w/ Example

database

Chapter Index

  1. Introduction
  2. Normalization
  3. Foreign Keys
  4. Denormalization
  5. Referential Integrity
  6. Summary

This is a free sample chapter from Beginning PHP 4 Databases published by Wrox Press.

Denormalization

Now that we’ve invested a valuable chunk of our day in learning about normalization, it’s time to introduce the concept of database denormalization, which is exactly what it sounds like: decreasing a schema’s level of normalization. As we’ve learned, normalization eliminates the data redundancy within a table, which greatly reduces the risk that data may become inconsistent. Why would one wish to reverse this process?

Normalization usually comes at a cost: speed of retrieval. Before normalization, if we wanted to know a donor’s name, the dates of the donations, and the name of the project, it was all right there in one record for us to pick. After normalization, we have to go traversing through three or four tables for the same information. In most cases, the extra work is worth it, considering the benefits of data consistency and reduced storage usage. However, in a few rare cases, the speed of data retrieval is the factor that trumps all others. In large databases with complex schemas, one might sometimes require data from twelve or more tables in a single query, and the application may need to perform this type of query hundreds of times per minute. In such situations, a fully normalized database may be unacceptably slow.

Denormalization should not be done early, however. It is a last desperate resort that one should turn to only after exhausting all other options (like query optimization, improved indexing, and database system tuning, all of which will be discussed later in the book). Normally, follow the simple rule:

When in doubt, normalize.

One alternative to denormalizing the base tables (the tables that make up a database) is to create a separate reporting table so that the base tables are left unaffected. For example, suppose that in our previous example, we very frequently need to retrieve a donor’s name, donation ID, and the date of the donation. The query often proves to be too slow in providing results. This may not seem realistic, given that it only involves two tables and any modern RDBMS would handle this with break-neck speed, but just use your imagination. We might be tempted to re-enter the donor’s name to our Donation table:

DonationIDDonorIDDonorDateAmountProjectID
11Marco Pinelli13Dec2002$2001
22Victor Gomez15Dec2002$1002
32Victor Gomez15Dec2002$1003
42Victor Gomez15Dec2002$504
53Seung Young Lee16Dec2002$1504

This is a heart-breaking departure from everything we’ve worked so hard to achieve. See the redundancy? See the wasted space?

With a separate reporting table, our three base tables of DonationDonor, and Project remain beautifully normalized. The schema as a whole is not fully normalized, because the reporting table itself is redundant, but at least all of the redundancy is concentrated and isolated in one table, whose sole job is to provide quick access to data that comes from multiple sources. Thus in the Donor table we have:

DonorIDNameCountry
1Marco PinelliItaly
2Victor GomezUnited States
3Seung Yong LeeSouth Korea

The Project table is as follows:

ProjectIDNameDescription
1Solar ScholarsPowering schools with solar panels
2Pear Creek cleanupCleaning up litter and pollutants from Pear Creek
3Danube Land TrustPurchasing and preserving land in the Danube watershed
4Forest AsiaPlanting trees in Asia

The Donation table is as follows:

DonationIDDonorIDDateAmountProjectID
1113Dec2002$2001
2215Dec2002$1002
3215Dec2002$1003
4215Dec2002$504
5316Dec2002$1504

The purpose of the three tables shown above is to properly store the organization’s data in a way that is consistent and reliable. They are the core tables on which the organization’s applications will be based. In contrast, the Report_DonorName_Date table below is solely designed to facilitate a specific report without involving the base tables. We have not abided strictly to normalization with this table in order to provide a single location where the most frequently requested data might be accessed quickly.

The Report_DonorName_Date table is as follows:

DonationIDDonorNameDate
1Marco Pinelli13Dec2002
2Victor Gomez15Dec2002
3Victor Gomez15Dec2002
4Victor Gomez15Dec2002
5Seung Yong Lee16Dec2002

A reporting table is usually used as a data cache – a place to store amalgamated or semi-amalgamated data for fast access, which reduces demand on the main location where the data are stored in a raw state. Depending on the business requirements of the application, it might even be possible to only fill the Report_DonorName_Date table periodically, say overnight when the system is least busy. Others among us are not quite as lucky, and have to ensure that the data in the reporting table is no older than ten minutes, or even ten seconds. Even then, the reporting table offers a performance advantage. It is better to query the base tables once every ten seconds than hundreds or thousands of times per minute. Triggers, which will be discussed in Chapter 7, can be useful in keeping a reporting table up-to-date.

Denormalization is not pretty, but it is often helpful. If you absolutely must do it, then do it; but make sure you feel guilty about it, just like the professionals. A simple rule for beginners to database design is never denormalize.

> > Referential Integrity

Database Denormalization Related Questions

  • What is a denormalized database?
  • What is the purpose of denormalization?
  • What is denormalization in DBMS with examples?
  • What is the difference between normalization and denormalization?

Also Read

Get The Latest Updates

Subscribe To Our Weekly Newsletter

No spam, notifications only about new products, updates.