Home > Articles > Understanding Relational Databases: Normalization

Understanding Relational Databases: Normalization

skip to navigation

This section of the site features articles published between 2002 and 2004. They remain here for reference purposes and may contain information that is out of date.

Chapter Index

This is a free sample chapter from Beginning PHP 4 Databases published by Wrox Press.

Normalization

A schema (pl. schemata) is the basic organization of the database - the structure of the tables and the relationships among them. The term normalization refers to a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database's schema. There are six forms of normalization described in relational database theory:

In practice, database designers mainly concern themselves with the first three forms. The last three are somewhat atypical in the practical world, but remain important in the realm of academic database theory.

First Normal Form

For a schema to meet the requirements of the First Normal Form (1NF), every record within a table should have the same "shape", which means that every field should contain a single value only, and each row should contain the same fields. Duplicated fields (or groups of fields) should be removed.

In our charity donation example, the repeated donation fields, like Amount1 and Amount2, are a violation of First Normal Form. Some of the records (such as the one with Victor Gomez) use all of the fields, while others (such as Marco Pinelli), use only a few of them. These records clearly do not have the same shape. We will now change the table so that each record represents a single donation, thereby eliminating the duplicated fields:

Donor Date Amount Project Description
Marco Pinelli 13Dec2002 $200 Solar Scholars Powering schools with solar panels
Victor Gomez 15Dec2002 $100 Pear Creek cleanup Cleaning up litter and pollutants from Pear Creek
Victor Gomez 15Dec2002 $100 Danube Land Trust Purchasing and preserving land in the Danube watershed
Victor Gomez 15Dec2002 $50 Forest Asia Planting trees in Asia
Seung Yong Lee 16Dec2002 $150 Forest Asia Planting trees in Asia

This arrangement is a significant improvement over our previous schema, since it offers greater flexibility. No matter how many donations Victor Gomez contributes, our new table can easily accommodate it simply by adding records.

An additional requirement of First Normal Form is that each record be uniquely identifiable. As we learned earlier in the chapter, this is accomplished with a primary key. In our original table, the Donor field could serve as the primary key. There was only one record per donor, and knowing the donor name was all that one needed to locate the specific record for that donor.

Our new table is a little more complicated. The donor name alone cannot serve as a primary key, since there may be multiple records for the same donor. For example, the name Victor Gomez does not uniquely identify any one record. Similarly, none of the values in the Date, Amount, or Project fields is necessarily unique. However, the combination of Donor, Date, and Project is unique. This is known as a composite key, a key that is composed of more than one field, as opposed to a simple key, which uses only one field. While there are many records for Victor Gomez, there is only one record for the same on 15Dec2002 for the Forest Asia project.

Composite primary keys are often the ideal solution for uniquely identifying records in a table. However, in our particular example, the composite key that we have chosen may be disadvantageous too. Although there are no two records in the table with the same combination of values for Donor, Date, and Project , we need to ask ourselves if it is impossible that the same donor could make more than one donation to the same project in a day. Though unlikely, it does seem possible, and we want our database schema to account for all possibilities. In such a situation, even a composite key is inadequate.

The primary keys that we have seen so far are natural or logical keys - keys that are derived from data that already existed in the table, such as the Donor field in our original table. When neither a simple key nor a composite key can be derived from among the fields in the table, it is common practice to create a new field dedicated to the purpose of serving as primary key. Such a key is known as an artificial key, or a surrogate key. Surrogate keys are fields whose values are changed automatically, to a value previously unused in that field with each new record. Let's add a surrogate key to uniquely represent each donation in the table. We'll place it at the beginning of the table, since that is where primary keys are conventionally kept:

DonationID Donor Date Amount Project Description
1 Marco Pinelli 13Dec2002 $200 Solar Scholars Powering schools with solar panels
2 Victor Gomez 15Dec2002 $100 Pear Creek cleanup Cleaning up litter and pollutants from Pear Creek
3 Victor Gomez 15Dec2002 $100 Danube Land Trust Purchasing and preserving land in the Danube watershed
4 Victor Gomez 15Dec2002 $50 Forest Asia Planting trees in Asia
5 Seung Yong Lee 16Dec2002 $150 Forest Asia Planting trees in Asia

It is very common for a surrogate key to have a name that both describes the type of entity the table represents (such as a donation) and indicates that the purpose of the field is to identify the record (such as the ID).

This table now meets the requirements of First Normal Form. To summarize those requirements once more:

Second and Third Normal Forms

Second Normal Form (2NF) and Third Normal Form (3NF) are very similar. Both are primarily concerned with eliminating data redundancy within a table. For a schema to be in Second Normal Form, it must meet the following requirements:

For a schema to be in Third Normal Form, it must meet the following requirements:

The first point of each list is important to note. The steps of normalization are cumulative. A schema cannot conform to Third Normal Form if it does not also conform to Second Normal Form, which in turn means that it must conform to First Normal Form.

The second points of the two lists are similar and may be combined like this:

All non-primary-key fields which are not fully dependent on the primary key should be removed and placed in a separate table.

For example, the Date field is an attribute of the donation itself. It describes something about the donation, and therefore the Date field is considered to be dependent on the donation's primary key, the DonationID. Similarly, the Project field modifies the donation. It describes the project that the donation is for. However, the Description field does not directly modify the donation; it describes the project. It is thus dependent on the project, which is not the primary key of the table. Therefore, the Description field should be removed from the Donation table and placed in a separate table specific to the projects:

Name Description
Solar Scholars Powering schools with solar panels
Pear Creek cleanup Cleaning up litter and pollutants from Pear Creek
Danube Land Trust Purchasing and preserving land in the Danube watershed
Forest Asia Planting trees in Asia

In the Donation table, we now have:

DonationID Donor Date Amount Project
1 Marco Pinelli 13Dec2002 $200 Solar Scholars
2 Victor Gomez 15Dec2002 $100 Pear Creek cleanup
3 Victor Gomez 15Dec2002 $100 Danube Land Trust
4 Victor Gomez 15Dec2002 $50 Forest Asia
5 Seung Yong Lee 16Dec2002 $150 Forest Asia

Not only does the introduction of a new table simplify our original table's design considerably, it also offers a new level of flexibility. We now have a place where we can store additional information about the project, such as the project's director or date of inception. In other words, our schema is no longer two-dimensional. The new table enables us to handle not only details about the donations, but also details about the projects to which donations are made. Now that we have introduced more than one table, it should be noted that the tables' names must be unique within the database.

We need to ensure that our new table also complies with the First Normal Form. It obviously does not have any duplicated fields. Does it have a primary key to uniquely identify each record? The project's name is unique within the table, but again we must ask, is it possible for two projects to have the same name? It is conceivable that the Pear Creek cleanup might be an annual event, and might be regarded as a new project each year, with the same name as the previous year's project. To be safe, it is a good idea to introduce a surrogate key to the Project table to serve as the primary key:

ProjectID Name Description
1 Solar Scholars Powering schools with solar panels
2 Pear Creek cleanup Cleaning up litter and pollutants from Pear Creek
3 Danube Land Trust Purchasing and preserving land in the Danube watershed
4 Forest Asia Planting trees in Asia

> > Foreign Keys