Indexes are not always a solution to the problem; they can also be the problem by themselves. The following example outlines a common pitfall. It should be avoided at all costs:
test=# CREATE TABLE t_test (id int, x text); CREATE TABLE test=# INSERT INTO t_test SELECT x, 'house' FROM generate_series(1, 10000000) AS x; INSERT 0 10000000 test=# CREATE INDEX idx_x ON t_test (x); CREATE INDEX
Before taking a look at the way the index is used, it makes sense to inspect the size of the table as well as the size of the indexes:
test=# SELECT pg_size_pretty(pg_relation_size('t_test')), pg_size_pretty(pg_relation_size('idx_x')); pg_size_pretty | pg_size_pretty ----------------+---------------- 422 MB | 214 MB (1 row)
The table created in this example is 422 MB large (only the table). On top of that, there is 214 MB taken up by the index. So, overall the size of the table with the index is larger than 600 MB.
The problem is that the index is of no use in this example:
test=# explain SELECT * FROM t_test WHERE x = 'house'; QUERY PLAN ------------------------------------------------------- Seq Scan on t_test (cost=0.00..179054.03 rows=10000000 width=10) Filter: (x = 'house'::text) (2 rows)
Despite the index, PostgreSQL will still use a sequential scan. The reason for that is simple: all values in the table are the same, the table is not selective enough, and therefore, the index is more or less useless. Why doesn't PostgreSQL use the index at all? Let's think about it for a moment. The goal of the index is to reduce I/O. If we expect to utilize all rows in the table, we'll have to read the entire table anyway, but if we use the index, it would be necessary to read all index entries on top of the table. Reading the index and the entire table is a lot more expensive than just reading the table. Therefore, PostgreSQL decides on the cheaper strategy.
However, the index is not always useless:
test=# explain SELECT * FROM t_test WHERE x = 'y'; QUERY PLAN ------------------------------------------------------- Index Scan using idx_x on t_test (cost=0.43..4.45 rows=1 width=10) Index Cond: (x = 'y'::text) Planning time: 0.096 ms (3 rows)
In this example, we choose to select a value that is not present in the table (or is expected to be pretty rare). In this case, the index is chosen. In other words, selectivity is the key to the usefulness of an index. If a field is not selective, an index will be pointless.
In this chapter, the importance of indexing has already been demonstrated. However, what is the best way to figure out whether indexes are missing or not? The pg_stat_user_tables
view is a system view containing all the relevant information. Here is a query that will definitely be beneficial to you:
SELECT schemaname, relname, seq_scan, seq_tup_read, idx_scan, seq_tup_read / seq_scan FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_tup_read DESC;
The query provides us with schemaname
and relname
to identify the table inside the database. The seq_scan
field will return the number of times the table has been read sequentially. During those sequential scans, the database had to read seq_tup_read
rows. The idx_scan
field informs us about the number of index scans, and finally, the average number of rows needed by seq_scan
is displayed.
The beauty of this query is that those tables that may need an index will show up on top. How do you know? A table that is read sequentially over and over again and that contains a large number of rows is definitely a candidate.
It obviously makes sense to go through the top candidates and check each table. Keep in mind that PostgreSQL can tell you which tables may be a problem, but it won't tell you which columns have to be indexed. Some basic knowledge about the application is needed. Otherwise, you will end up doing guesswork. It is really necessary to figure out which columns your application filters on. At this point, there is no automatic algorithm to check that.
Some of you might wonder why too many indexes are bad. While reading data, indexes are no problem at all in 99 percent of all cases (unless the planner makes a bad decision or some caches are too small). However, when it comes to insertion, indexes are a major performance bottleneck.
To compare things, a simple test can be created. One million rows should be added to an empty table as well as to an indexed table. To compare runtimes, iming
is used:
test=# iming Timing is on.
The beauty of iming
is that the time needed to execute a query is displayed right at the end of the command. First, 1 million rows are added:
test=# INSERT INTO t_test SELECT * FROM generate_series(1, 1000000); INSERT 0 1000000 Time: 6346.756 ms
It takes around 6 seconds to do this. What if data is added to an empty table?
test=# CREATE TABLE t_fast (id int, x text); CREATE TABLE Time: 92.034 ms test=# INSERT INTO t_fast SELECT * FROM generate_series(1, 1000000); INSERT 0 1000000 Time: 2078.413 ms
Doing the same thing without an index is around three times faster. Keep in mind that the main bottleneck in this example is actually the creation of the data itself. If generate_series
was free, the difference would have been a lot larger. You have to keep in mind that every index that does not yield any benefit will be a destructive index because it needs space on the disk, and more importantly, it will slow down writes dramatically. If your application is write-bound, additional indexes can be a killer. However, in my professional career, I have seen that write-bound applications are really a minority. Therefore, overindexing can be as dangerous as underindexing. However, underindexing is the more obvious problem because you will instantly see that certain queries are slow. If there are too many indexes, performance problems are usually a bit more subtle.
To prevent problems related to too many indexes, it takes time to check for useless indexes periodically. PostgreSQL has all the onboard means to do that. The best place to start is a system view called pg_stat_user_indexes
. It tells you which indexes are used and how often:
test=# d pg_stat_user_indexes View "pg_catalog.pg_stat_user_indexes" Column | Type | Modifiers ---------------+--------+----------- relid | oid | indexrelid | oid | schemaname | name | relname | name | indexrelname | name | idx_scan | bigint | idx_tup_read | bigint | idx_tup_fetch | bigint |
The relevant field here is called idx_scan
. It tells us how often a certain index has been utilized. If this index is rarely used, it might make sense to drop it.
Dropping indexes can free up a lot of space, reduce I/O, and boost writes.