Using indexes in an intelligent way

So far, applying an index feels like the Holy Grail, which always improves performance magically. However, this is not the case. Indexes can also be pretty pointless in some cases.

Before digging into things more deeply, here is the data structure we have used for this example. Remember that there are only two distinct names and unique IDs:

test=# d t_test 
Table "public.t_test"
Column | Type | Modifiers
--------+---------+------------------------------------
id | integer | not null default nextval('t_test_id_seq'::regclass)
name | text |
Indexes:
"idx_id" btree (id)

At this point, one index has been defined, which covers the id column. In the next step, the name column will be queried. Before doing this, an index on the name will be created:

test=# CREATE INDEX idx_name ON t_test (name);  
CREATE INDEX 

Now, it is time to see if the index is used correctly:

test=# EXPLAIN SELECT * FROM t_test WHERE name = 'hans2'; 
QUERY PLAN
-----------------------------------------------------
Index Scan using idx_name on t_test
(cost=0.43..4.45 rows=1 width=9)
Index Cond: (name = 'hans2'::text)
(2 rows)

As expected, PostgreSQL will decide on using the index. Most users would expect this. But note that my query says hans2. Remember, hans2 does not exist in the table and the query plan perfectly reflects this. rows=1 indicates that the planner only expects a very small subset of data being returned by the query.

There is not a single row in the table, but PostgreSQL will never estimate zero rows because it would make subsequent estimations a lot harder, as useful cost calculations of other nodes in the plan would be close to impossible.

Let's see what happens if we look for more data:

test=# EXPLAIN SELECT * 
FROM t_test
WHERE name = 'hans'
OR name = 'paul';
QUERY PLAN
---------------------------------------------------------------
Seq Scan on t_test (cost=0.00..81622.00 rows=3000011 width=9)
Filter: ((name = 'hans'::text) OR (name = 'paul'::text))
(2 rows)

In this case, PostgreSQL will go for a straight sequential scan. Why is that? Why is the system ignoring all indexes? The reason is simple: hans and paul make up the entire dataset because there are no other values (PostgreSQL knows that by checking the system statistics). Therefore, PostgreSQL figures that the entire table has to be read anyway. There is no reason to read all of the index and the full table if reading the table is sufficient.

In other words, PostgreSQL will not use an index just because there is one. PostgreSQL will use indexes when they make sense. If the number of rows is smaller, PostgreSQL will again consider bitmap scans and normal index scans:

test=# EXPLAIN SELECT * 
FROM t_test
WHERE name = 'hans2'
OR name = 'paul2';

QUERY PLAN
---------------------------------------------------------------------
Bitmap Heap Scan on t_test (cost=8.88..12.89 rows=1 width=9)
Recheck Cond: ((name = 'hans2'::text) OR (name = 'paul2'::text))
-> BitmapOr (cost=8.88..8.88 rows=1 width=0)
-> Bitmap Index Scan on idx_name
(cost=0.00..4.44 rows=1 width=0)

Index Cond: (name = 'hans2'::text)
-> Bitmap Index Scan on idx_name
(cost=0.00..4.44 rows=1 width=0)

Index Cond: (name = 'paul2'::text)

The most important point to learn here is that execution plans depend on input values.

They are not static and not independent of the data inside the table. This is a very important observation, which has to be kept in mind all the time. In real-world examples, the fact that plans change can often be the reason for unpredictable runtimes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset