Adjusting parameters for good query performance

Writing good queries is the first step to achieving good performance. Without a good query, you will most likely suffer from bad performance. Therefore, writing good and intelligent code will give you the greatest edge possible. Once your queries have been optimized from a logical and semantical point of view, good memory settings can provide you with a final nice speedup.

In this section, we will learn what more memory can do for you and how PostgreSQL can use it for your benefit. Again, this section assumes that we are using single-core queries to make the plans more readable. To ensure that there is always just one core at work, use the following command:

test=# SET max_parallel_workers_per_gather TO 0;
SET

Here is a simple example demonstrating what memory parameters can do for you:

test=# CREATE TABLE t_test (id serial, name text);
CREATE TABLE
test=# INSERT INTO t_test (name) 
   SELECT 'hans' FROM generate_series(1, 100000);
INSERT 0 100000
test=# INSERT INTO t_test (name) 
   SELECT 'paul' FROM generate_series(1, 100000);
INSERT 0 100000

1 million rows containing hans will be added to the table. Then, 1 million rows containing paul will be loaded. Altogether, there will be 2 million unique IDs, but just two different names.

Let's run a simple query by using PostgreSQL's default memory settings:

test=# SELECT name, count(*) FROM t_test GROUP BY 1;
 name  | count
-------+--------
 hans  | 100000
 paul  | 100000
(2 rows)

Two rows will be returned, which should not come as a surprise. The important thing here is not the result, but what PostgreSQL is doing behind the scenes:

test=# EXPLAIN ANALYZE SELECT name, count(*) 
  FROM t_test
  GROUP BY 1;
                           QUERY PLAN 
----------------------------------------------------------------- 
 HashAggregate (cost=4082.00..4082.01 rows=1 width=13) 
   (actual time=51.448..51.448 rows=2 loops=1) 
               Group Key: name 
   -> Seq Scan on t_test 
         (cost=0.00..3082.00 rows=200000 width=5) 
         (actual time=0.007..14.150 rows=200000 loops=1) 
 Planning time: 0.032 ms 
 Execution time: 51.471 ms 
(5 rows)

PostgreSQL figured out that the number of groups is actually very small. Therefore, it creates a hash, adds one hash entry per group, and starts to count. Due to the low number of groups, the hash is really small and PostgreSQL can quickly do the count by incrementing the numbers for each group.

What happens if we group by ID and not by name? The number of groups will skyrocket, as shown in the following code:

test=# EXPLAIN ANALYZE SELECT id, count(*) FROM t_test GROUP BY 1;
                                 QUERY PLAN 
----------------------------------------------------------------- 
 GroupAggregate (cost=23428.64..26928.64 rows=200000 width=12) 
     (actual time=97.128..154.205 rows=200000 loops=1) 
     Group Key: id 
     -> Sort (cost=23428.64..23928.64 rows=200000 width=4) 
        (actual time=97.120..113.017 rows=200000 loops=1) 
        Sort Key: id 
        Sort Method: external sort Disk: 2736kB 
        -> Seq Scan on t_test 
           (cost=0.00..3082.00 rows=200000 width=4) 
           (actual time=0.017..19.469 rows=200000 loops=1) 
 Planning time: 0.128 ms 
 Execution time: 160.589 ms 
(8 rows)

PostgreSQL figures out that the number of groups is now a lot larger and quickly changes its strategy. The problem is that a hash containing so many entries doesn't fit into memory:

test=# SHOW work_mem ;
 work_mem
----------
 4MB
(1 row)

As we can see, the work_mem variable governs the size of the hash that's used by the GROUP BY clause. Since there are too many entries, PostgreSQL has to find a strategy that doesn't require that we hold the entire dataset in memory. The solution is to sort the data by ID and group it. Once the data has been sorted, PostgreSQL can move down the list and form one group after the other. If the first type of value is counted, the partial result is read and can be emitted. Then, the next group can be processed. Once the value in the sorted list changes when moving down, it will never show up again; thus, the system knows that a partial result is ready.

To speed up the query, a higher value for the work_mem variable can be set on the fly (and, of course, globally):

test=# SET work_mem TO '1 GB';
SET

Now, the plan will, once again, feature a fast and efficient hash aggregate:

test=# EXPLAIN ANALYZE SELECT id, count(*) FROM t_test GROUP BY 1;
                                 QUERY PLAN 
----------------------------------------------------------------- 
 HashAggregate (cost=4082.00..6082.00 rows=200000 width=12) 
   (actual time=76.967..118.926 rows=200000 loops=1) 
   Group Key: id 
   -> Seq Scan on t_test 
      (cost=0.00..3082.00 rows=200000 width=4) 
      (actual time=0.008..13.570 rows=200000 loops=1) 
 Planning time: 0.073 ms 
 Execution time: 126.456 ms 
(5 rows)

PostgreSQL knows (or at least assumes) that data will fit into memory and switch to the faster plan. As you can see, the execution time is lower. The query won't be as fast as in the GROUP BY name case because many more hash values have to be calculated, but you will be able to see a nice and reliable benefit in the vast majority of cases.

Table of Contents for Adjusting parameters for good query performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Adjusting parameters for good query performance