Understanding Relational Databases: Denormalization
This section of the site features articles published between 2002 and 2004. They remain here for reference purposes and may contain information that is out of date.
This is a free sample chapter from Beginning PHP 4 Databases published by Wrox Press.
Now that we've invested a valuable chunk of our day in learning about normalization, it's time to introduce the concept of denormalization, which is exactly what it sounds like: decreasing a schema's level of normalization. As we've learned, normalization eliminates the data redundancy within a table, which greatly reduces the risk that data may become inconsistent. Why would one wish to reverse this process?
Normalization usually comes at a cost: speed of retrieval. Before normalization, if we wanted to know a donor's name, the dates of the donations, and the name of the project, it was all right there in one record for us to pick. After normalization, we have to go traversing through three or four tables for the same information. In most cases, the extra work is worth it, considering the benefits of data consistency and reduced storage usage. However, in a few rare cases, the speed of data retrieval is the factor that trumps all others. In large databases with complex schemas, one might sometimes require data from twelve or more tables in a single query, and the application may need to perform this type of query hundreds of times per minute. In such situations, a fully normalized database may be unacceptably slow.
Denormalization should not be done early, however. It is a last desperate resort that one should turn to only after exhausting all other options (like query optimization, improved indexing, and database system tuning, all of which will be discussed later in the book). Normally, follow the simple rule:
When in doubt, normalize.
One alternative to denormalizing the base tables (the tables that make up a database) is to create a separate reporting table so that the base tables are left unaffected. For example, suppose that in our previous example, we very frequently need to retrieve a donor's name, donation ID, and the date of the donation. The query often proves to be too slow in providing results. This may not seem realistic, given that it only involves two tables and any modern RDBMS would handle this with break-neck speed, but just use your imagination. We might be tempted to re-enter the donor's name to our
|5||3||Seung Young Lee||16Dec2002||$150||4|
This is a heart-breaking departure from everything we've worked so hard to achieve. See the redundancy? See the wasted space?
With a separate reporting table, our three base tables of
Project remain beautifully normalized. The schema as a whole is not fully normalized, because the reporting table itself is redundant, but at least all of the redundancy is concentrated and isolated in one table, whose sole job is to provide quick access to data that comes from multiple sources. Thus in the
Donor table we have:
|2||Victor Gomez||United States|
|3||Seung Yong Lee||South Korea|
Project table is as follows:
|1||Solar Scholars||Powering schools with solar panels|
|2||Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek|
|3||Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|4||Forest Asia||Planting trees in Asia|
Donation table is as follows:
The purpose of the three tables shown above is to properly store the organization's data in a way that is consistent and reliable. They are the core tables on which the organization's applications will be based. In contrast, the
Report_DonorName_Date table below is solely designed to facilitate a specific report without involving the base tables. We have not abided strictly to normalization with this table in order to provide a single location where the most frequently requested data might be accessed quickly.
Report_DonorName_Date table is as follows:
|5||Seung Yong Lee||16Dec2002|
A reporting table is usually used as a data cache - a place to store amalgamated or semi-amalgamated data for fast access, which reduces demand on the main location where the data are stored in a raw state. Depending on the business requirements of the application, it might even be possible to only fill the
Report_DonorName_Date table periodically, say overnight when the system is least busy. Others among us are not quite as lucky, and have to ensure that the data in the reporting table is no older than ten minutes, or even ten seconds. Even then, the reporting table offers a performance advantage. It is better to query the base tables once every ten seconds than hundreds or thousands of times per minute. Triggers, which will be discussed in Chapter 7, can be useful in keeping a reporting table up-to-date.
Denormalization is not pretty, but it is often helpful. If you absolutely must do it, then do it; but make sure you feel guilty about it, just like the professionals. A simple rule for beginners to database design is never denormalize.