Understanding Relational Databases: Introduction
This section of the site features articles published between 2002 and 2004. They remain here for reference purposes and may contain information that is out of date.
Technical level: Intermediate || Date: 19th January 2003 || Author: John Blank, Wankyu Choi, Allan Kent, Ganesh Prasad, Chris Ullman
This is a free sample chapter from Beginning PHP 4 Databases published by Wrox Press.
As we have briefly seen in the previous chapter, there are many possible ways to structure data, such as trees, lists, and objects. In relational databases, data is structured as relations. Relations are expressed as sets of tuples, which are implemented in the form of tables. Conceptually, tables are easy to grasp, since it is a form that is familiar to most people. Anyone who has read a spreadsheet, a train timetable, or even a television guide is already familiar with the organization of data into columns and rows. In this chapter, we will lay out the basic concepts of relational databases and describe the process of organizing data in a relational manner. The topics covered in this chapter are:
- Schema normalization
- The process by which redundancy and inconsistency are eliminated from the database
- Fields or combinations of fields that are used to identify records
- Referential integrity
- A state of consistency for database schemata
- Entity relationship diagrams
- Models used to design databases
As mentioned earlier, a table is a data structure that uses columns and rows to organize the data. As an example, consider the following ledger sheet sample of a charity containing details of the donations:
|Donor||Donation 1||Donation 2||Donation 3|
|Marco Pinelli||$200 Solar
|Victor Gomez||$100 Pear Creek||$100 Danube
|$50 Forest Asia|
|Seung Yong Lee||$150 Forest Asia|
Tables represent entities, which are unique items like people, objects, and relationships about which we wish to store data. Each row represents an instance of the entity. In the above example, each row represents an instance of one donor. In relational database terminology, an instance is known as a record, but the terms row or tuple are also used.
Each column represents an attribute of the entity, or something about the entity. In this case, each column represents a donation made by the donor, listing the amount of the donation and the project to which the money is donated. In relational database terminology, an attribute is known as a field, but the term column is also very common. Adding or removing columns would change both the data stored in the table and the actual structure of the table, whereas adding or removing rows would only change data stored in the table. In other words, removing a column removes information about entities whereas removing a row only removes one instance of an entity but no information about them in general.
As we shall see in the next chapter, each field in a table is assigned a data type. The type indicates what sort of data will be stored in that field: text data, integer data, boolean (true or false) data, and so on. The assigned type then applies to that field's value for every record in the table.
We create databases because we need to store information. For the information in the database to be useful, we need to be able to perform certain operations on it. These operations fall broadly into two categories: reading the data, and changing the data. Whether one wishes to read a record, update it, or delete it, one first needs to identify the record in a way that distinguishes it from the other records in the table.
This is where keys come in. A key is a field or a combination of fields whose value identifies a record for a given purpose. One type of key is a unique key, which can be used to identify a single record. For example, every book has a unique ISBN (International Standard Book Number) that marks the book unmistakably. If a table of information about books includes an ISBN field, then that field can serve as a unique key.
A table might have more than one unique key. Suppose that each book in our table also has a unique product ID. While there is no problem with the existence of more than one unique key, it is considered desirable to have one that stands out as the primary key - the key that is considered the foremost means of identifying a record. In this case each of the unique keys is known as a candidate key, since each has the possibility of serving as the primary key. It is then up to the database designer to designate the primary key from among the candidate keys.
In our example from the previous section, the
Donor field is a candidate key, if we accept (for now) that each donor is unique within the table. We shall revisit the topic of keys later in this chapter.
A Few Inadequacies
At first glance, a simple table such as the one shown in the chapter seems to meet all of our needs for storing data. When designing and filling the table, we may add as many fields and records as we like to accommodate large amounts of data. But after some examination, we are likely to encounter quite a few failings with our table. What if a donor makes more than three donations? We can add more fields to the table, of course, but to change the structure of a database once it is in use is extremely inconvenient. Also it is difficult to know in advance how many donations would be enough. What if one donor makes dozens of donations?
What if we wish to store more information about a project, such as a description? Or even more information about a donation, such as the date? Again, while it is conceivable that additional columns could address this issue, such a solution would be awkward and wasteful. If columns named
DonationDate2 are added, the same uncertainty over the appropriate number of columns exists. Adding a description after every project name produces a lot of redundant data, since each project appears in the table multiple times. Every time a new donation is made, the description of the project would have to be repeated. Such redundancy is very inefficient as seen in the following table:
|Marco Pinelli||$200||Solar Scholars||Powering schools with solar panels|
|Victor Gomez||$100||Pear Creek cleanup||Cleaning up litter and pollutants from Pear Creek||$100||Danube Land Trust||Purchasing and preserving land in the Danube watershed|
|Seung Yong Lee||$150||Forest Asia||Planting trees in Asia|
The underlying problem is that a table is two-dimensional. It consists of columns and rows. Real-world data is usually multi-dimensional. We wish to store not only the data relevant to the donors and donations, but also data that relates to details in the table, such as additional information about the projects. There is a solution to our problem. Relational databases allow us to create multiple tables of related data. The database designer uses the relationships between these tables to represent multi-dimensional data. This is also why they are called relational databases. Let's now look at the process of normalization in relational databases.