Understanding Relational Databases: Normalization
This section of the site features articles published between 2002 and 2004. They remain here for reference purposes and may contain information that is out of date.
This is a free sample chapter from Beginning PHP 4 Databases published by Wrox Press.
A schema (pl. schemata) is the basic organization of the database - the structure of the tables and the relationships among them. The term normalization refers to a series of steps used to eliminate redundancy and reduce the chances of data inconsistency in a database's schema. There are six forms of normalization described in relational database theory:
- First Normal Form
- Second Normal Form
- Third Normal Form
- Boyce Codd Normal Form
- Fourth Normal Form
- Fifth Normal Form
In practice, database designers mainly concern themselves with the first three forms. The last three are somewhat atypical in the practical world, but remain important in the realm of academic database theory.
First Normal Form
For a schema to meet the requirements of the First Normal Form (1NF), every record within a table should have the same "shape", which means that every field should contain a single value only, and each row should contain the same fields. Duplicated fields (or groups of fields) should be removed.
In our charity donation example, the repeated donation fields, like
Amount2, are a violation of First Normal Form. Some of the records (such as the one with
Gomez) use all of the fields, while others (such as
Pinelli), use only a few of them. These records clearly do not have the same shape. We will now change the table so that each record represents a single donation, thereby eliminating the duplicated fields:
|Marco Pinelli||13Dec2002||$200||Solar Scholars||Powering schools with solar panels|
|Victor Gomez||15Dec2002||$100||Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek|
|Victor Gomez||15Dec2002||$100||Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|Victor Gomez||15Dec2002||$50||Forest Asia||Planting trees in Asia|
|Seung Yong Lee||16Dec2002||$150||Forest Asia||Planting trees in Asia|
This arrangement is a significant improvement over our previous schema, since it offers greater flexibility. No matter how many donations
Victor Gomez contributes, our new table can easily accommodate it simply by adding records.
An additional requirement of First Normal Form is that each record be uniquely identifiable. As we learned earlier in the chapter, this is accomplished with a primary key. In our original table, the
Donor field could serve as the primary key. There was only one record per donor, and knowing the donor name was all that one needed to locate the specific record for that donor.
Our new table is a little more complicated. The donor name alone cannot serve as a primary key, since there may be multiple records for the same donor. For example, the name
Victor Gomez does not uniquely identify any one record. Similarly, none of the values in the
Project fields is necessarily unique. However, the combination of
Project is unique. This is known as a composite key, a key that is composed of more than one field, as opposed to a simple key, which uses only one field. While there are many records for
Victor Gomez, there is only one record for the same on
15Dec2002 for the
Forest Asia project.
Composite primary keys are often the ideal solution for uniquely identifying records in a table. However, in our particular example, the composite key that we have chosen may be disadvantageous too. Although there are no two records in the table with the same combination of values for
Date, and Project , we need to ask ourselves if it is impossible that the same donor could make more than one donation to the same project in a day. Though unlikely, it does seem possible, and we want our database schema to account for all possibilities. In such a situation, even a composite key is inadequate.
The primary keys that we have seen so far are natural or logical keys - keys that are derived from data that already existed in the table, such as the
Donor field in our original table. When neither a simple key nor a composite key can be derived from among the fields in the table, it is common practice to create a new field dedicated to the purpose of serving as primary key. Such a key is known as an artificial key, or a surrogate key. Surrogate keys are fields whose values are changed automatically, to a value previously unused in that field with each new record. Let's add a surrogate key to uniquely represent each donation in the table. We'll place it at the beginning of the table, since that is where primary keys are conventionally kept:
|1||Marco Pinelli||13Dec2002||$200||Solar Scholars||Powering schools with solar panels|
|2||Victor Gomez||15Dec2002||$100||Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek|
|3||Victor Gomez||15Dec2002||$100||Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|4||Victor Gomez||15Dec2002||$50||Forest Asia||Planting trees in Asia|
|5||Seung Yong Lee||16Dec2002||$150||Forest Asia||Planting trees in Asia|
It is very common for a surrogate key to have a name that both describes the type of entity the table represents (such as a donation) and indicates that the purpose of the field is to identify the record (such as the ID).
This table now meets the requirements of First Normal Form. To summarize those requirements once more:
- All records in a table should have the same shape, or number of fields. Repetitive fields or groups of fields should be eliminated.
- Each record should be uniquely identifiable within the table.
Second and Third Normal Forms
Second Normal Form (2NF) and Third Normal Form (3NF) are very similar. Both are primarily concerned with eliminating data redundancy within a table. For a schema to be in Second Normal Form, it must meet the following requirements:
- The schema must meet all requirements of First Normal Form
- All non-primary-key fields which are dependent on part but not all of a primary key should be removed and placed in a separate table.
For a schema to be in Third Normal Form, it must meet the following requirements:
- The schema must meet all requirements of Second Normal Form
- All fields which are dependent on a non-primary-key field should be removed and placed in a separate table.
The first point of each list is important to note. The steps of normalization are cumulative. A schema cannot conform to Third Normal Form if it does not also conform to Second Normal Form, which in turn means that it must conform to First Normal Form.
The second points of the two lists are similar and may be combined like this:
All non-primary-key fields which are not fully dependent on the primary key should be removed and placed in a separate table.
For example, the
Date field is an attribute of the donation itself. It describes something about the donation, and therefore the
Date field is considered to be dependent on the donation's primary key, the
DonationID. Similarly, the
Project field modifies the donation. It describes the project that the donation is for. However, the
Description field does not directly modify the donation; it describes the project. It is thus dependent on the project, which is not the primary key of the table. Therefore, the
Description field should be removed from the
Donation table and placed in a separate table specific to the projects:
|Solar Scholars||Powering schools with solar panels|
|Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek|
|Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|Forest Asia||Planting trees in Asia|
Donation table, we now have:
|1||Marco Pinelli||13Dec2002||$200||Solar Scholars|
|2||Victor Gomez||15Dec2002||$100||Pear Creek cleanup|
|3||Victor Gomez||15Dec2002||$100||Danube Land Trust|
|4||Victor Gomez||15Dec2002||$50||Forest Asia|
|5||Seung Yong Lee||16Dec2002||$150||Forest Asia|
Not only does the introduction of a new table simplify our original table's design considerably, it also offers a new level of flexibility. We now have a place where we can store additional information about the project, such as the project's director or date of inception. In other words, our schema is no longer two-dimensional. The new table enables us to handle not only details about the donations, but also details about the projects to which donations are made. Now that we have introduced more than one table, it should be noted that the tables' names must be unique within the database.
We need to ensure that our new table also complies with the First Normal Form. It obviously does not have any duplicated fields. Does it have a primary key to uniquely identify each record? The project's name is unique within the table, but again we must ask, is it possible for two projects to have the same name? It is conceivable that the Pear Creek cleanup might be an annual event, and might be regarded as a new project each year, with the same name as the previous year's project. To be safe, it is a good idea to introduce a surrogate key to the
Project table to serve as the primary key:
|1||Solar Scholars||Powering schools with solar panels|
|2||Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek|
|3||Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|4||Forest Asia||Planting trees in Asia|