Fixing full-text search

After focusing on some highly important and common problems related to indexing, it is time to discuss the most important pitfalls in the area of full-text indexing.

Not using full-text search at all

One of the most common mistakes made is not to use full-text search at all. Many people tend to use LIKE instead of proper, PostgreSQL-style full-text searching. Doing that can open a Pandora's box and result in various performance issues.

To demonstrate the point, some sample text can be loaded. In this case, Shakespeare's Hamlet has been selected, which can be downloaded freely from www.gutenberg.org (an archive of free books). To load the data, end users can turn to curl, just as was shown before in the previous example. PostgreSQL will read the data from the pipe:

test=# CREATE TABLE t_text (payload text);
CREATE TABLE
test=# COPY t_text FROM PROGRAM 'curl http://www.gutenberg.org/cache/epub/2265/pg2265.txt';
COPY 5302

In this example, 5302 lines of text have been loaded. The goal is to look for all lines that contain company and find. Here is what you should not do:

test=# SELECT * FROM t_text 
  WHERE payload ILIKE '%company%' 
    AND payload ILIKE '%find%';
                  payload                   
--------------------------------------------
 What company, at what expense: and finding
(1 row)

One line has been returned. This does not look like a problem but it is. Let's assume you are not sure whether you are looking for company or companies. If you google for car, you would be happy with cars as well, wouldn't you? The same applies here. So, instead of asking PostgreSQL for a fixed string, it makes sense to process the data in the table before indexing it. The technical term here is stemming. Here is how it can be done:

test=# SELECT to_tsvector('english', 
  'What company, at what expense: and finding'),
           to_tsvector           
---------------------------------
 'compani':2 'expens':5 'find':7
(1 row)

English language rules are used to stem the text just returned. What comes out is a set of three tokens. Note that some words are missing, as they are simple stop words. There is no semantic payload in what, at, or and, so those words can safely be skipped.

The next important observation is that words are stemmed. Stemming is a process that treats words such that they are reduced to some sort of root (not really the root of the word but usually pretty similar to it). The following listing shows an example of stemming:

test=# SELECT to_tsvector('english', 
  'company, companies, find, finding'),
       to_tsvector        
--------------------------
 'compani':1,2 'find':3,4
(1 row)

As you can see, company and companies are both reduced to compani, which is not a real word but a good representation of the actual meaning. The same applies to find and finding. The important part now is that this vector of stemmed words is indexed; the word showing up in the text isn't. The beauty here is that if you are searching for car, you will also end up finding cars—a nice, little improvement!

There is one more observation; mind the numbers showing up behind the stemmed word. Those numbers represent the position of the word in the stemmed string.

Note

Keep in mind that stemming is a highly language-dependent process. Therefore, the language must be passed to the call.

To ensure good performance, it is necessary to create an index. For full-text searches, GIN indexes have proven to be the most beneficial index. Here is how it works for our copy of Hamlet:

test=# CREATE INDEX idx_gin ON t_text 
  USING gin (to_tsvector('english', payload));
CREATE INDEX
test=# SELECT * 
  FROM t_text 
  WHERE to_tsvector('english', payload) 
    @@ to_tsquery('english', 'find & company'),
                  payload                   
--------------------------------------------
 What company, at what expense: and finding
(1 row)

Once the index has been created, it can be used. The important part here is that the tsvector function created by the stemming process is compared to our search string. The goal is to find all lines containing find and company. Exactly one line is returned here. Notice that company shows up before finding in the sentence. The search string is stemmed as well, so this result is possible.

To prove the preceding point, the following list contains the execution plan showing the index scan:

test=# explain SELECT * 
  FROM   t_text 
  WHERE   to_tsvector('english', payload) 
    @@ to_tsquery('english', 'find & company'),
                QUERY PLAN   
-------------------------------------------------------
 Bitmap Heap Scan on t_text  
  (cost=20.00..24.02 rows=1 width=33)
   Recheck Cond: (to_tsvector('english'::regconfig, 
  payload) @@ '''find'' & ''compani'''::tsquery)
   ->  Bitmap Index Scan on idx_gin  
  (cost=0.00..20.00 rows=1 width=0)
         Index Cond: (to_tsvector('english'::regconfig,
     payload) @@ '''find'' & 
        ''compani'''::tsquery)
 Planning time: 0.096 ms
(5 rows)

PostgreSQL's full-text indexing provides a lot more functionality than is actually described here. However, given the scope of this book, it is not possible to cover all of this wonderful functionality.

Full-text search and sorting

Full-text search is powerful. However, there are also some corner cases you should be aware of. Consider the following type of query:

SELECT * FROM product WHERE "fti_query" ORDER BY price;

The use case for such a query can be simple: find all books containing certain words in the title, and sort them by price; or find all restaurants containing pizza in their description and sort them by postal code (in most countries, a number).

The way a GIN index is organized is perfect for full-text search. The problem is that inside GIN, the item pointers pointing to the row in the table are sorted by position, not by some criteria such as price.

So let's assume the following query:

  SELECT * 
    FROM   products 
    WHERE field @@ to_tsquery('english', 'book')
    ORDER BY price;

During my career as a PostgreSQL consultant, I came across a database that contained 4 million product descriptions with the word book. What will happen here is that PostgreSQL will have to find all of these books and sort them by price. Fetching 4 million rows from an index and having to sort 4 million rows is actually a death sentence to the performance of the query.

People might argue now, "Why not narrow down the price window?" In this case, it won't help because a price range of 20 to 50 euros would still include, say, 3 million products.

As of PostgreSQL 9.4, it is very hard to fix this type of query. Therefore, developers have to be very careful and aware of the fact that a GIN index does not necessarily provide sorted output. Therefore, the amount of data returned from this type of query is directly proportional to the runtime of the SQL request.

Future versions of PostgreSQL may solve this problem once and for all. In the meantime, however, people have to live with the fact that you cannot expect an infinite number of rows to be sorted in no time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset