Applying bloom filters

Since PostgreSQL 9.6, it has been possible to add index types on the fly using extensions. The new CREATE ACCESS METHOD command, along with some additional features, has made it possible for us to create fully functional and transaction-logged index types on the fly.

The bloom extension provides PostgreSQL users with bloom filters, which are prefilters that help us efficiently reduce the amount of data as soon as possible. The idea behind a bloom filter is that we can calculate a bitmask and compare the bitmask with the query. The bloom filter may produce some false positives, but it will still reduce the amount of data dramatically.

This is especially useful when a table consists of hundreds of columns and millions of rows. It isn't possible to index hundreds of columns with B-trees, so a bloom filter is a good alternative because it allows us to index everything at once.

To understand how things work, we will install the extension:

test=# CREATE EXTENSION bloom;
CREATE EXTENSION

Now, we need to create a table containing various columns:

test=# CREATE TABLE t_bloom 
(
id serial,
col1 int4 DEFAULT random() * 1000,
col2 int4 DEFAULT random() * 1000,
col3 int4 DEFAULT random() * 1000,
col4 int4 DEFAULT random() * 1000,
col5 int4 DEFAULT random() * 1000,
col6 int4 DEFAULT random() * 1000,
col7 int4 DEFAULT random() * 1000,
col8 int4 DEFAULT random() * 1000,
col9 int4 DEFAULT random() * 1000
);
CREATE TABLE

To make this easier, these columns have a default value so that data can easily be added using a simple SELECT clause:

test=# INSERT INTO t_bloom (id) 
SELECT * FROM generate_series(1, 1000000);

INSERT 0 1000000

The preceding query adds 1 million rows to the table. Now, the table can be indexed:

test=# CREATE INDEX idx_bloom ON t_bloom 
USING bloom(col1, col2, col3, col4, col5, col6, col7, col8, col9);
CREATE INDEX

Note that the index contains nine columns at a time. In contrast to a B-tree, the order of those columns doesn't really make a difference.

Note that the table we just created is around 65 MB without indexes.

The index adds another 15 MB to the storage footprint:

test=# di+ idx_bloom 
List of relations
Schema | Name | Type | Owner | Table | Size | Description
--------+-----------+-------+-------+---------+-------+-------------
public | idx_bloom | index | hs | t_bloom | 15 MB |
(1 row)

The beauty of the bloom filter is that it is possible to look for any combination of columns:

test=# SET max_parallel_workers_per_gather TO 0;
SET

test=# explain SELECT count(*)
FROM t_bloom
WHERE col4 = 454 AND col3 = 354 AND col9 = 423;
QUERY PLAN
-----------------------------------------------------------------------
Aggregate (cost=20352.02..20352.03 rows=1 width=8)
-> Bitmap Heap Scan on t_bloom
(cost=20348.00..20352.02
rows=1 width=0)
Recheck Cond: ((col3 = 354) AND (col4 = 454) AND (col9 = 423))
-> Bitmap Index Scan on idx_bloom
(cost=0.00..20348.00 rows=1 width=0)
Index Cond: ((col3 = 354) AND (col4 = 454)
AND (col9 = 423))

(5 rows)

What you have seen so far feels exceptional. A natural question that might arise is: Why not always use a bloom filter? The reason is simple—the database has to read the entire bloom filter in order to use it. In the case of, say, a B-tree, this is not necessary.

In the future, more index types will likely be added to ensure that even more use cases can be covered with PostgreSQL.

If you want to read more about bloom filters, consider reading our blog post at https://www.cybertec-postgresql.com/en/trying-out-postgres-bloom-indexes/.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset