Index selectivity

Let's take a look at the account_history table in the car web portal example. The unique constraint, UNIQUE (account_id, search_key, search_date), has two purposes. The first purpose is to define the validity constraint for inserting the search key into the table only once each day, even if the user searches for the key several times. The second purpose is to retrieve data quickly. Let's assume that we would like to show the last 10 searches for a user. The query for performing this would be as follows:

SELECT search_key FROM account_history WHERE account_id = <account> GROUP BY search_key ORDER BY max(search_date) limit 10;

The preceding query returns only 10 records containing the different search_key ordered by search_date. If we assume that the search account_history contains millions of rows, then reading all of the data will take a lot of time. In this case, this is not true, because the unique index will help us in reading the data only for this particular customer.

Indexes on tables are not used if the table size is small. The PostgreSQL planner will scan the complete table instead. To show this, the account_history was populated with a very small dataset, generated as follows:

WITH test_account AS( INSERT INTO account VALUES (1000, 'test_first_name', 'test_last_name','[email protected]', 'password') RETURNING account_id
),car AS ( SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV')) AS foo(i)
),manufacturing_date AS ( SELECT 'year='|| i as date FROM generate_series (2015, 2014, -1) as foo(i))
INSERT INTO account_history (account_id, search_key, search_date) SELECT account_id, car.car_model||'&'||manufacturing_date.date, current_date
FROM test_account, car, manufacturing_date;
VACUUM ANALYZE;

To test whether the index is used, let's run the query as follows:

car_portal=> SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10;
search_key
---------------------
brand=WV&year=2014
brand=BMW&year=2014
brand=WV&year=2015
brand=BMW&year=2015
(4 rows)

car_portal=> EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10;
QUERY PLAN
----------------------------------------------------------------------------------
Limit (cost=1.17..1.18 rows=3 width=23)
-> Sort (cost=1.17..1.18 rows=3 width=23)
Sort Key: (max(search_date))
-> HashAggregate (cost=1.12..1.15 rows=3 width=23)
Group Key: search_key
-> Seq Scan on account_history (cost=0.00..1.10 rows=4 width=23)
Filter: (account_id = 1000)

The index in the above example is not used; the PostgreSQL planner decides whether to use an index based on the execution plan cost. For the same query with different parameters, the planner might pick a different plan based on the data histogram. So, even if we have a big dataset, and the predicate used in the query does not filter a lot of data, the index will not be used. To create such a scenario, let's insert some entries only for another account and rerun the query as follows: 

WITH test_account AS( INSERT INTO account VALUES (2000, 'test_first_name', 'test_last_name','[email protected]', 'password') RETURNING account_id
),car AS ( SELECT i as car_model FROM (VALUES('brand=BMW'), ('brand=WV'), ('brand=Audi'), ('brand=MB')) AS foo(i)
),manufacturing_date AS ( SELECT 'year='|| i as date FROM generate_series (2017, 1900, -1) as foo(i))
INSERT INTO account_history (account_id, search_key, search_date) SELECT account_id, car.car_model||'&'||manufacturing_date.date, current_date
FROM test_account, car, manufacturing_date;
VACUUM ANALYZE;

To run the query for the second account: 

car_portal=> EXPLAIN SELECT search_key FROM account_history WHERE account_id = 2000 GROUP BY search_key ORDER BY max(search_date) limit 10;
QUERY PLAN
------------------------------------------------------------------------------------
Limit (cost=27.10..27.13 rows=10 width=23)
-> Sort (cost=27.10..28.27 rows=468 width=23)
Sort Key: (max(search_date))
-> HashAggregate (cost=12.31..16.99 rows=468 width=23)
Group Key: search_key
-> Seq Scan on account_history (cost=0.00..9.95 rows=472 width=23)
Filter: (account_id = 2000)
(7 rows)

Again, the index is not used, because we would like to read the whole table. Index selectivity here is very low since account 2000 has 427 records, where account 1000 only has four records, as shown below:

car_portal=> SELECT count(*), account_id FROM account_history group by account_id;
count | account_id
-------+------------
472 | 2000
4 | 1000
(2 rows)

Finally, let's rerun the query again for account 1000. In this case, the selectivity is high, thus the index will be used:

EXPLAIN SELECT search_key FROM account_history WHERE account_id = 1000 GROUP BY search_key ORDER BY max(search_date) limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=8.71..8.72 rows=4 width=23)
-> Sort (cost=8.71..8.72 rows=4 width=23)
Sort Key: (max(search_date))
-> GroupAggregate (cost=8.60..8.67 rows=4 width=23)
Group Key: search_key
-> Sort (cost=8.60..8.61 rows=4 width=23)
Sort Key: search_key
-> Bitmap Heap Scan on account_history (cost=4.30..8.56 rows=4 width=23)
Recheck Cond: (account_id = 1000)
-> Bitmap Index Scan on account_history_account_id_search_key_search_date_key (cost=0.00..4.30 rows=4 width=0)
Index Cond: (account_id = 1000)
(11 rows)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset