Day 3: Full Text and Multidimensions

We’ll spend Day 3 investigating the many tools at our disposal to build a movie query system. We’ll begin with the many ways PostgreSQL can search actor/movie names using fuzzy string matching. Then we’ll discover the cube package by creating a movie suggestion system based on similar genres of movies we already like. Because these are all contributed packages, the implementations are special to PostgreSQL and not part of the SQL standard.

Often, when designing a relational database schema, you’ll start with an entity diagram. We’ll be writing a personal movie suggestion system that keeps track of movies, their genres, and their actors, as modeled in the figure.

Before we begin the Day 3 exercises, we’ll need to extend Postgres by installing the following contributed packages: tablefunc, dict_xsyn, fuzzystrmatch, pg_trgm, and cube. You can refer to the website for installation instructions.[9]

images/postgres-movies.png

Run the following command and check that it matches the output below to ensure your contrib packages have been installed correctly:

 $ ​​psql​​ ​​7dbs​​ ​​-c​​ ​​"SELECT '1'::cube;"
 cube
 ------
 (1)
 (1 row)

Seek out the online docs for more information if you receive an error message.

Let’s first build the database. It’s often good practice to create indexes on foreign keys to speed up reverse lookups (such as what movies this actor is involved in). You should also set a UNIQUE constraint on join tables like movies_actors to avoid duplicate join values.

 CREATE​ ​TABLE​ genres (
 name​ ​text​ ​UNIQUE​,
  position ​integer
 );
 
 CREATE​ ​TABLE​ movies (
  movie_id ​SERIAL​ ​PRIMARY​ ​KEY​,
  title ​text​,
  genre ​cube
 );
 
 CREATE​ ​TABLE​ actors (
  actor_id ​SERIAL​ ​PRIMARY​ ​KEY​,
 name​ ​text
 );
 
 CREATE​ ​TABLE​ movies_actors (
  movie_id ​integer​ ​REFERENCES​ movies ​NOT​ ​NULL​,
  actor_id ​integer​ ​REFERENCES​ actors ​NOT​ ​NULL​,
 UNIQUE​ (movie_id, actor_id)
 );
 
 CREATE​ ​INDEX​ movies_actors_movie_id ​ON​ movies_actors (movie_id);
 CREATE​ ​INDEX​ movies_actors_actor_id ​ON​ movies_actors (actor_id);
 CREATE​ ​INDEX​ movies_genres_cube ​ON​ movies ​USING​ gist (genre);

You can download the movies_data.sql file as a file alongside the book and populate the tables by piping the file into the database. Any questions you may have about the genre cube will be covered later today.

Fuzzy Searching

Opening up a system to text searches means opening your system to inaccurate inputs. You have to expect typos like “Brid of Frankstein.” Sometimes, users can’t remember the full name of “J. Roberts.” In other cases, we just plain don’t know how to spell “Benn Aflek.“ We’ll look into a few PostgreSQL packages that make text searching easy.

It’s worth noting that as we progress, this kind of string matching blurs the lines between relational queries and searching frameworks such as Lucene[10] and Elasticsearch.[11] Although some may feel that features like full-text search belong with the application code, there can be performance and administrative benefits of pushing these packages to the database, where the data lives.

SQL Standard String Matches

PostgreSQL has many ways of performing text matches but the two major default methods are LIKE and regular expressions.

I Like LIKE and ILIKE

LIKE and ILIKE are the simplest forms of text search (ILIKE is a case-insensitive version of LIKE). They are fairly universal in relational databases. LIKE compares column values against a given pattern string. The % and _ characters are wildcards: % matches any number of any characters while _ matches exactly one character.

 SELECT​ title ​FROM​ movies ​WHERE​ title ILIKE ​'stardust%'​;
 
  title
 -------------------
  Stardust
  Stardust Memories

If we want to be sure the substring stardust is not at the end of the string, we can use the underscore (_) character as a little trick.

 SELECT​ title ​FROM​ movies ​WHERE​ title ILIKE ​'stardust_%'​;
 
  title
 -------------------
  Stardust Memories

This is useful in basic cases, but LIKE is limited to simple wildcards.

Regex

A more powerful string-matching syntax is a regular expression (regex). Regexes appear often throughout this book because many databases support them. There are entire books dedicated to writing powerful expressions—the topic is far too wide and complex to cover in depth here. Postgres conforms (mostly) to the POSIX style.

In Postgres, a regular expression match is led by the ~ operator, with the optional ! (meaning not matching) and * (meaning case insensitive). To count all movies that do not begin with the, the following case-insensitive query will work. The characters inside the string are the regular expression.

 SELECT​ COUNT(*) ​FROM​ movies ​WHERE​ title !~* ​'^the.*'​;

You can index strings for pattern matching the previous queries by creating a text_pattern_ops operator class index, as long as the values are indexed in lowercase.

 CREATE​ ​INDEX​ movies_title_pattern ​ON​ movies (lower(title) text_pattern_ops);

We used the text_pattern_ops because the title is of type text. If you need to index varchars, chars, or names, use the related ops: varchar_pattern_ops, bpchar_pattern_ops, and name_pattern_ops.

Bride of Levenshtein

Levenshtein is a string comparison algorithm that compares how similar two strings are by how many steps are required to change one string into another. Each replaced, missing, or added character counts as a step. The distance is the total number of steps away. In PostgreSQL, the levenshtein function is provided by the fuzzystrmatch contrib package. Say we have the string bat and the string fads.

 SELECT​ levenshtein(​'bat'​, ​'fads'​);

The Levenshtein distance is 3 because in order to go from bat to fads, we replaced two letters (b=>f, t=>d) and added a letter (+s). Each change increments the distance. We can watch the distance close as we step closer (so to speak). The total goes down until we get zero (the two strings are equal).

 SELECT​ levenshtein(​'bat'​, ​'fad'​) fad,
  levenshtein(​'bat'​, ​'fat'​) fat,
  levenshtein(​'bat'​, ​'bat'​) bat;
 
  fad | fat | bat
 -----+-----+-----
  2 | 1 | 0

Changes in case cost a point, too, so you may find it best to convert all strings to the same case when querying.

 SELECT​ movie_id, title ​FROM​ movies
 WHERE​ levenshtein(lower(title), lower(​'a hard day nght'​)) <= 3;
 
  movie_id | title
 ----------+--------------------
  245 | A Hard ​Day​​'s Night

This ensures minor differences won’t over-inflate the distance.

Try a Trigram

A trigram is a group of three consecutive characters taken from a string. The pg_trgm contrib module breaks a string into as many trigrams as it can.

 SELECT​ show_trgm(​'Avatar'​);
 
  show_trgm
 -------------------------------------
 {​" a"," av","ar ",ata,ava,tar,vat​}

Finding a matching string is as simple as counting the number of matching trigrams. The strings with the most matches are the most similar. It’s useful for doing a search where you’re okay with either slight misspellings or even minor words missing. The longer the string, the more trigrams and the more likely a match—they’re great for something like movie titles because they have relatively similar lengths. We’ll create a trigram index against movie names to start, using Generalized Index Search Tree (GIST), a generic index API made available by the PostgreSQL engine.

 CREATE​ ​INDEX​ movies_title_trigram ​ON​ movies
 USING​ gist (title gist_trgm_ops);

Now you can query with a few misspellings and still get decent results.

 SELECT​ title
 FROM​ movies
 WHERE​ title % ​'Avatre'​;
 
  title
 ---------
  Avatar

Trigrams are an excellent choice for accepting user input without weighing queries down with wildcard complexity.

Full-Text Fun

Next, we want to allow users to perform full-text searches based on matching words, even if they’re pluralized. If a user wants to search for certain words in a movie title but can remember only some of them, Postgres supports simple natural language processing.

TSVector and TSQuery

Let’s look for a movie that contains the words night and day. This is a perfect job for text search using the @@ full-text query operator.

 SELECT​ title
 FROM​ movies
 WHERE​ title @@ ​'night & day'​;
 
  title
 -------------------------------
  A Hard ​Day​​'s Night
  Six Days Seven Nights
  Long Day'​s Journey ​Into​ Night

The query returns titles like A Hard Day’s Night, despite the word Day being in possessive form and the fact that the two words are out of order in the query. The @@ operator converts the name field into a tsvector and converts the query into a tsquery.

A tsvector is a datatype that splits a string into an array (or a vector) of tokens, which are searched against the given query, while the tsquery represents a query in some language, like English or French. The language corresponds to a dictionary (which we’ll see more of in a few paragraphs). The previous query is equivalent to the following (if your system language is set to English):

 SELECT​ title
 FROM​ movies
 WHERE​ to_tsvector(title) @@ to_tsquery(​'english'​, ​'night & day'​);

You can take a look at how the vector and the query break apart the values by running the conversion functions on the strings outright.

 SELECT​ to_tsvector(​'A Hard Day​​''​​s Night'​),
  to_tsquery(​'english'​, ​'night & day'​);
 
  to_tsvector | to_tsquery
 ---------------------------+-----------------
 'day'​:3 ​'hard'​:2 ​'night'​:5 | ​'night'​ & ​'day'

The tokens on a tsvector are called lexemes and are coupled with their positions in the given phrase.

You may have noticed the tsvector for A Hard Day’s Night did not contain the lexeme a. Moreover, simple English words such as a are missing if you try to query by them.

 SELECT​ *
 FROM​ movies
 WHERE​ title @@ to_tsquery(​'english'​, ​'a'​);
 
 NOTICE: ​text​-search ​query​ ​contains​ only ​stop​ words ​or​ doesn​'t ​​
  contain lexemes, ignored

Common words such as a are called stop words and are generally not useful for performing queries. The English dictionary was used by the parser to normalize our string into useful English components. In your console, you can view the output of the stop words under the English tsearch_data directory.

 $ ​​cat​​ ​​`pg_config​​ ​​--sharedir`/tsearch_data/english.stop

We could remove a from the list, or we could use another dictionary like simple that just breaks up strings by nonword characters and makes them lowercase. Compare these two vectors:

 SELECT​ to_tsvector(​'english'​, ​'A Hard Day​​''​​s Night'​);
 
  to_tsvector
 ----------------------------
 'day'​:3 ​'hard'​:2 ​'night'​:5
 
 SELECT​ to_tsvector(​'simple'​, ​'A Hard Day​​''​​s Night'​);
 
  to_tsvector
 ----------------------------------------
 'a'​:1 ​'day'​:3 ​'hard'​:2 ​'night'​:5 ​'s'​:4

With simple, you can retrieve any movie containing the lexeme a.

Other Languages

Because Postgres is doing some natural language processing here, it only makes sense that different configurations would be used for different languages. All of the installed configurations can be viewed with this command:

 7dbs=# ​​dF

Dictionaries are part of what Postgres uses to generate tsvector lexemes (along with stop words and other tokenizing rules we haven’t covered called parsers and templates). You can view your system’s list here:

 7dbs=# ​​dFd

You can test any dictionary outright by calling the ts_lexize function. Here we find the English stem word of the string Day’s.

 SELECT​ ts_lexize(​'english_stem'​, ​'Day​​''​​s'​);
 
  ts_lexize
 -----------
 {​​day​​}

Finally, the previous full-text commands work for other languages, too. If you have German installed, try this:

 SELECT​ to_tsvector(​'german'​, ​'was machst du gerade?'​);
 
  to_tsvector
 --------------------
 'gerad'​:4 ​'mach'​:2

Because was (what) and du (you) are common, they are marked as stop words in the German dictionary, while machst (doing) and gerade (at the moment) are stemmed.

Indexing Lexemes

Full-text search is powerful. But if we don’t index our tables, it’s also slow. The EXPLAIN command is a powerful tool for digging into how queries are internally planned.

 EXPLAIN
 SELECT​ *
 FROM​ movies
 WHERE​ title @@ ​'night & day'​;
 
 QUERY​ PLAN
 ---------------------------------------------------------------------------
  Seq Scan ​on​ movies (cost=0.00..815.86 ​rows​=3 width=171)
  Filter: (title @@ ​'night & day'​::​text​)

Note the line Seq Scan on movies. That’s rarely a good sign in a query because it means a whole table scan is taking place; each row will be read. That usually means that you need to create an index.

We’ll use Generalized Inverted iNdex (GIN)—like GIST, it’s an index API—to create an index of lexeme values we can query against. The term inverted index may sound familiar to you if you’ve ever used a search engine like Lucene or Sphinx. It’s a common data structure to index full-text searches.

 CREATE​ ​INDEX​ movies_title_searchable ​ON​ movies
 USING​ gin(to_tsvector(​'english'​, title));

With our index in place, let’s try to search again.

 EXPLAIN
 SELECT​ *
 FROM​ movies
 WHERE​ title @@ ​'night & day'​;
 
 QUERY​ PLAN
 ---------------------------------------------------------------------------
  Seq Scan ​on​ movies (cost=0.00..815.86 ​rows​=3 width=171)
  Filter: (title @@ ​'night & day'​::​text​)

What happened? Nothing. The index is there, but Postgres isn’t using it because our GIN index specifically uses the english configuration for building its tsvectors, but we aren’t specifying that vector. We need to specify it in the WHERE clause of the query.

 EXPLAIN
 SELECT​ *
 FROM​ movies
 WHERE​ to_tsvector(​'english'​,title) @@ ​'night & day'​;

That will return this query plan:

 QUERY​ PLAN
 ------------------------------------------------------------------
  Bitmap Heap Scan ​on​ movies (cost=20.00..24.26 ​rows​=1 width=171)
  Recheck Cond: (to_tsvector(​'english'​::regconfig, title) @@
 '​​''​​night​​''​​ & ​​''​​day​​''​​'​::tsquery)
  -> Bitmap ​Index​ Scan ​on​ movies_title_searchable
  (cost=0.00..20.00 ​rows​=1 width=0)
 Index​ Cond: (to_tsvector(​'english'​::regconfig, title) @@
 '​​''​​night​​''​​ & ​​''​​day​​''​​'​::tsquery)

EXPLAIN is important to ensure indexes are used as you expect them. Otherwise, the index is just wasted overhead.

Metaphones

We’ve inched toward matching less specific inputs. LIKE and regular expressions require crafting patterns that can match strings precisely according to their format. Levenshtein distance allows you to find matches that contain minor misspellings but must ultimately be very close to the same string. Trigrams are a good choice for finding reasonable misspelled matches. Finally, full-text searching allows natural language flexibility in that it can ignore minor words such as a and the and can deal with pluralization. Sometimes we just don’t know how to spell words correctly but we know how they sound.

We love Bruce Willis and would love to see what movies he’s in. Unfortunately, we can’t remember exactly how to spell his name, so we sound it out as best we can.

 SELECT​ *
 FROM​ actors
 WHERE​ ​name​ = ​'Broos Wils'​;

Even a trigram is no good here (using % rather than =).

 SELECT​ *
 FROM​ actors
 WHERE​ ​name​ % ​'Broos Wils'​;

Enter the metaphones, which are algorithms for creating a string representation of word sounds. You can define how many characters are in the output string. For example, the seven-character metaphone of the name Aaron Eckhart is ARNKHRT.

To find all films with actors with names sounding like Broos Wils, we can query against the metaphone output. Note that NATURAL JOIN is an INNER JOIN that automatically joins ON matching column names (for example, movies.actor_id=movies_actors.actor_id).

 SELECT​ title
 FROM​ movies ​NATURAL​ ​JOIN​ movies_actors ​NATURAL​ ​JOIN​ actors
 WHERE​ metaphone(​name​, 6) = metaphone(​'Broos Wils'​, 6);
 
  title
 -----------------------------
  The Fifth Element
  Twelve Monkeys
  Armageddon
  Die Hard
  Pulp Fiction
  The Sixth Sense
 :

If you peek at the online documentation, you’ll see the fuzzystrmatch module contains other functions: dmetaphone (double metaphone), dmetaphone_alt (for alternative name pronunciations), and soundex (a really old algorithm from the 1880s made by the U.S. Census to compare common American surnames).

You can dissect the functions’ representations by selecting their output.

 SELECT​ ​name​, dmetaphone(​name​), dmetaphone_alt(​name​),
  metaphone(​name​, 8), soundex(​name​)
 FROM​ actors;
 
 name​ | dmetaphone | dmetaphone_alt | metaphone | soundex
 ----------------+------------+----------------+-----------+--------
  50 Cent | SNT | SNT | SNT | C530
  Aaron Eckhart | ARNK | ARNK | ARNKHRT | A652
  Agatha Hurle | AK0R | AKTR | AK0HRL | A236

There is no single best function to choose, and the optimal choice depends on your dataset and use case.

Combining String Matches

With all of our string searching ducks in a row, we’re ready to start combining them in interesting ways.

One of the most flexible aspects of metaphones is that their outputs are just strings. This allows you to mix and match with other string matchers.

For example, we could use the trigram operator against metaphone outputs and then order the results by the lowest Levenshtein distance. This means “Get me names that sound the most like Robin Williams, in order.”

 SELECT​ * ​FROM​ actors
 WHERE​ metaphone(​name​,8) % metaphone(​'Robin Williams'​,8)
 ORDER​ ​BY​ levenshtein(lower(​'Robin Williams'​), lower(​name​));
 
  actor_id | ​name
 ----------+-----------------
  4093 | Robin Williams
  2442 | John Williams
  4479 | Steven Williams
  4090 | Robin Shou

Be warned, though, that unbridled exploitation of this flexibility can yield funny results.

 SELECT​ * ​FROM​ actors ​WHERE​ dmetaphone(​name​) % dmetaphone(​'Ron'​);

This will return a result set that includes actors like Renée Zellweger, Ringo Starr, and Randy Quaid.

The combinations are vast, limited only by your experimentations.

Genres as a Multidimensional Hypercube

The last contributed package we investigate is cube. We’ll use the cube datatype to map a movie’s genres as a multidimensional vector. We will then use methods to efficiently query for the closest points within the boundary of a hypercube to give us a list of similar movies.

As you may have noticed in the beginning of Day 3, we created a column named genres of type cube. Each value is a point in 18-dimensional space with each dimension representing a genre. Why represent movie genres as points in n-dimensional space? Movie categorization is not an exact science, and many movies are not 100 percent comedy or 100 percent tragedy—they are something in between.

In our system, each genre is scored from (the totally arbitrary numbers) 0 to 10 based on how strong the movie is within that genre—with 0 being nonexistent and 10 being the strongest.

Star Wars, for example, has a genre vector of (0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0). The genres table describes the position of each dimension in the vector. We can decrypt its genre values by extracting the cube_ur_coord(vector,dimension) using each genres.position. For clarity, we filter out genres with scores of 0.

 SELECT​ ​name​,
  cube_ur_coord(​'(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'​, position) ​as​ score
 FROM​ genres g
 WHERE​ cube_ur_coord(​'(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'​, position) > 0;
 
 name​ | score
 -----------+-------
  Adventure | 7
  Fantasy | 7
  SciFi | 10

We will find similar movies by finding the nearest points. To understand why this works, we can envision two movies on a two-dimensional genre graph, like the graph shown below. If your favorite movie is Animal House, you’ll probably want to see The 40-Year-Old Virgin more than Oedipus—a story famously lacking in comedy. In our two-dimensional universe, it’s a simple nearest-neighbor search to find likely matches.

images/postgres-2d-genres.png

We can extrapolate this into more dimensions with more genres, be it 2, 3, or 18. The principle is the same: a nearest-neighbor match to the nearest points in genre space will yield the closest genre matches.

The nearest matches to the genre vector can be discovered by the cube_distance(point1, point2). Here we can find the distance of all movies to the Star Wars genre vector, nearest first.

 SELECT​ *,
  cube_distance(genre, ​'(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'​) dist
 FROM​ movies
 ORDER​ ​BY​ dist;

We created the movies_genres_cube cube index earlier when we created the tables. However, even with an index, this query is still relatively slow because it requires a full-table scan. It computes the distance on every row and then sorts them.

Rather than compute the distance of every point, we can instead focus on likely points by way of a bounding cube. Just like finding the closest five towns on a map will be faster on a state map than a world map, bounding reduces the points we need to look at.

We use cube_enlarge(cube,radius,dimensions) to build an 18-dimensional cube that is some length (radius) wider than a point.

Let’s view a simpler example. If we built a two-dimensional square one unit around a point (1,1), the lower-left point of the square would be at (0,0), and the upper-right point would be (2,2).

 SELECT​ cube_enlarge(​'(1,1)'​,1,2);
 
  cube_enlarge
 ---------------
  (0, 0),(2, 2)

The same principle applies in any number of dimensions. With our bounding hypercube, we can use a special cube operator, @>, which means contains. This query finds the distance of all points contained within a five-unit cube of the Star Wars genre point.

 SELECT​ title,
  cube_distance(genre, ​'(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'​) dist
 FROM​ movies
 WHERE​ cube_enlarge(​'(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'​::​cube​, 5, 18)
  @> genre
 ORDER​ ​BY​ dist;
 
  title | dist
 ------------------------------------------------+------------------
  Star Wars | 0
  Star Wars: Episode V - The Empire Strikes Back | 2
  Avatar | 5
  Explorers | 5.74456264653803
  Krull | 6.48074069840786
  E.T. The Extra-Terrestrial | 7.61577310586391

Using a subselect, we can get the genre by movie name and perform our calculations against that genre using a table alias.

 SELECT​ m.movie_id, m.title
 FROM​ movies m, (​SELECT​ genre, title ​FROM​ movies ​WHERE​ title = ​'Mad Max'​) s
 WHERE​ cube_enlarge(s.genre, 5, 18) @> m.genre ​AND​ s.title <> m.title
 ORDER​ ​BY​ cube_distance(m.genre, s.genre)
 LIMIT​ 10;
 
  movie_id | title
 ----------+----------------------------
  1405 | Cyborg
  1391 | ​Escape​ ​from​ L.A.
  1192 | Mad Max Beyond Thunderdome
  1189 | Universal Soldier
  1222 | Soldier
  1362 | Johnny Mnemonic
  946 | Alive
  418 | ​Escape​ ​from​ ​New​ York
  1877 | The ​Last​ Starfighter
  1445 | The Rocketeer

This method of movie suggestion is not perfect, but it’s an excellent start. We will see more dimensional queries in later chapters, such as two-dimensional geographic searches in MongoDB (see GeoSpatial Queries).

Day 3 Wrap-Up

Today we jumped headlong into PostgreSQL’s flexibility in performing string searches and used the cube package for multidimensional searching. Most importantly, we caught a glimpse of the nonstandard extensions that put PostgreSQL at the top of the open source RDBMS field. There are dozens (if not hundreds) more extensions at your disposal, from geographic storage to cryptographic functions, custom datatypes, and language extensions. Beyond the core power of SQL, contrib packages make PostgreSQL shine.

Day 3 Homework

Find

  1. Find the online documentation listing all contributed packages bundled into Postgres. Read up on two that you could imagine yourself using in one of your projects.

  2. Find the online POSIX regex documentation (it will also come in handy in future chapters).

Do

  1. Create a stored procedure that enables you to input a movie title or an actor’s name and then receive the top five suggestions based on either movies the actor has starred in or films with similar genres.

  2. Expand the movies database to track user comments and extract keywords (minus English stopwords). Cross-reference these keywords with actors’ last names and try to find the most talked-about actors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset