When starting with databases, one of the first things people learn is data normalization. Normalization makes a lot of sense because it helps to reduce redundancy and it ensures that data is structured in a professional and nice way.
However, in some cases, normalization is taken way too far.
As mentioned in many cases, normalization can be easily exaggerated. In such a case, performance will suffer and things will be really complex. Here is an example of normalization taken too far: a few years ago, a professor at a local university confronted me with an interesting example when I was talking to students. He told me that he uses the following example to explain to students that normalization can easily be taken too far. He invented the 7th normal form (which, of course, does not exist in professional database literature).
Here is an example:
test=# CREATE TABLE t_test (name text); CREATE TABLE test=# INSERT INTO t_test VALUES ('sealevel'), INSERT 0 1
There is still some redundancy in sealevel
. Some letters show up twice, so we can normalize that:
test=# CREATE TABLE t_char (id int, char char(1)); CREATE TABLE test=# INSERT INTO t_char VALUES (1, 's'), (2, 'e'), (3, 'a'), (4, 'l'), (5, 'v'), INSERT 0 5
In this table, one entry is made for each letter. In addition to this, a second table can be created to finally form the word:
test=# CREATE TABLE t_word (order_id int, char_id int); CREATE TABLE
The sealevel
can, therefore, be represented as the following table:
test=# INSERT INTO t_word VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 2), (6, 5), (7, 2), (8, 4); INSERT 0 8
Joining entries together will reveal the word:
test=# SELECT order_id, char FROM t_word AS a, t_char AS b WHERE a.char_id = b.id ORDER BY order_id; order_id | char ----------+------ 1 | s 2 | e 3 | a 4 | l 5 | e 6 | v 7 | e 8 | l (8 rows)
What is the point of all this? The point is that you can normalize your data beyond recognition. But does it really make sense? In the example just outlined, everybody will agree that this is clearly useless and way beyond what is justified. Therefore, it is highly recommended if it is really worth introducing new relations, even if there is no chance of ever making changes to them.
To avoid excessive normalization, you can turn your attention to arrays. Arrays are a simple, and in many cases, a good way to reduce the number of relations, especially when you don't need hard integrity.
What is the main benefit of an array and of avoiding joins? Keep in mind that when using a join, you have to read data from two tables. This often means additional I/O, and most importantly, you have to join the data, which leads to sorting, hashing, merging, and so on.
If data is stored inline in an array, there is no need to read from two tables and join, merge, and so on. You may be able to enjoy the simplicity of a sequential scan or the speed of a single-index lookup to retrieve the data needed.
To show how arrays can be used to reduce normalization, the following example may be beneficial to you. The goal is to come up with a data structure for storing which people have been to which countries:
test=# CREATE TABLE t_country ( id int, country_code char(2) ); CREATE TABLE
Then some data can be added to the table:
test=# INSERT INTO t_country VALUES (1, 'DE'), (2, 'AT'), (3, 'BE'), (4, 'NL'), INSERT 0 4
In the next step, a table linking people and countries together is created:
test=# CREATE TABLE t_travel ( person text, country_id int); CREATE TABLE
Basically, this is the standard way of describing a relation like that. Take a look at this example:
test=# CREATE TABLE t_travel ( person text, country_list char(2)[] ); CREATE TABLE
It is very unlikely that the name of a country will change, and even if it does, you may not care too much. Imagine that somebody has been to the Soviet Union. Even after its breakup, you would still not change the name of the country you have visited in retrospect. Therefore, integrity is really not an issue here.
The beauty is that you can save on all the expenses that might be caused by the join. In fact, in large-scale systems, random I/O can be a major bottleneck, which should be avoided. By keeping data close together, you can avoid a ton of problems.
In fact, too many joins on too much data can be a real source of trouble in a productive system because as soon as your server hits a random I/O limitation imposed on you by your disk, performance will drop dramatically.