We’ll spend Day 3 investigating the many tools at our disposal to build a movie query system. We’ll begin with the many ways PostgreSQL can search actor/movie names using fuzzy string matching. Then we’ll discover the cube package by creating a movie suggestion system based on similar genres of movies we already like. Because these are all contributed packages, the implementations are special to PostgreSQL and not part of the SQL standard.
Often, when designing a relational database schema, you’ll start with an entity diagram. We’ll be writing a personal movie suggestion system that keeps track of movies, their genres, and their actors, as modeled in the figure.
Before we begin the Day 3 exercises, we’ll need to extend Postgres by installing the following contributed packages: tablefunc, dict_xsyn, fuzzystrmatch, pg_trgm, and cube. You can refer to the website for installation instructions.[9]
Run the following command and check that it matches the output below to ensure your contrib packages have been installed correctly:
| $ psql 7dbs -c "SELECT '1'::cube;" |
| cube |
| ------ |
| (1) |
| (1 row) |
Seek out the online docs for more information if you receive an error message.
Let’s first build the database. It’s often good practice to create indexes on foreign keys to speed up reverse lookups (such as what movies this actor is involved in). You should also set a UNIQUE constraint on join tables like movies_actors to avoid duplicate join values.
| CREATE TABLE genres ( |
| name text UNIQUE, |
| position integer |
| ); |
| |
| CREATE TABLE movies ( |
| movie_id SERIAL PRIMARY KEY, |
| title text, |
| genre cube |
| ); |
| |
| CREATE TABLE actors ( |
| actor_id SERIAL PRIMARY KEY, |
| name text |
| ); |
| |
| CREATE TABLE movies_actors ( |
| movie_id integer REFERENCES movies NOT NULL, |
| actor_id integer REFERENCES actors NOT NULL, |
| UNIQUE (movie_id, actor_id) |
| ); |
| |
| CREATE INDEX movies_actors_movie_id ON movies_actors (movie_id); |
| CREATE INDEX movies_actors_actor_id ON movies_actors (actor_id); |
| CREATE INDEX movies_genres_cube ON movies USING gist (genre); |
You can download the movies_data.sql file as a file alongside the book and populate the tables by piping the file into the database. Any questions you may have about the genre cube will be covered later today.
Opening up a system to text searches means opening your system to inaccurate inputs. You have to expect typos like “Brid of Frankstein.” Sometimes, users can’t remember the full name of “J. Roberts.” In other cases, we just plain don’t know how to spell “Benn Aflek.“ We’ll look into a few PostgreSQL packages that make text searching easy.
It’s worth noting that as we progress, this kind of string matching blurs the lines between relational queries and searching frameworks such as Lucene[10] and Elasticsearch.[11] Although some may feel that features like full-text search belong with the application code, there can be performance and administrative benefits of pushing these packages to the database, where the data lives.
PostgreSQL has many ways of performing text matches but the two major default methods are LIKE and regular expressions.
I Like LIKE and ILIKE
LIKE and ILIKE are the simplest forms of text search (ILIKE is a case-insensitive version of LIKE). They are fairly universal in relational databases. LIKE compares column values against a given pattern string. The % and _ characters are wildcards: % matches any number of any characters while _ matches exactly one character.
| SELECT title FROM movies WHERE title ILIKE 'stardust%'; |
| |
| title |
| ------------------- |
| Stardust |
| Stardust Memories |
If we want to be sure the substring stardust is not at the end of the string, we can use the underscore (_) character as a little trick.
| SELECT title FROM movies WHERE title ILIKE 'stardust_%'; |
| |
| title |
| ------------------- |
| Stardust Memories |
This is useful in basic cases, but LIKE is limited to simple wildcards.
Regex
A more powerful string-matching syntax is a regular expression (regex). Regexes appear often throughout this book because many databases support them. There are entire books dedicated to writing powerful expressions—the topic is far too wide and complex to cover in depth here. Postgres conforms (mostly) to the POSIX style.
In Postgres, a regular expression match is led by the ~ operator, with the optional ! (meaning not matching) and * (meaning case insensitive). To count all movies that do not begin with the, the following case-insensitive query will work. The characters inside the string are the regular expression.
| SELECT COUNT(*) FROM movies WHERE title !~* '^the.*'; |
You can index strings for pattern matching the previous queries by creating a text_pattern_ops operator class index, as long as the values are indexed in lowercase.
| CREATE INDEX movies_title_pattern ON movies (lower(title) text_pattern_ops); |
We used the text_pattern_ops because the title is of type text. If you need to index varchars, chars, or names, use the related ops: varchar_pattern_ops, bpchar_pattern_ops, and name_pattern_ops.
Levenshtein is a string comparison algorithm that compares how similar two strings are by how many steps are required to change one string into another. Each replaced, missing, or added character counts as a step. The distance is the total number of steps away. In PostgreSQL, the levenshtein function is provided by the fuzzystrmatch contrib package. Say we have the string bat and the string fads.
| SELECT levenshtein('bat', 'fads'); |
The Levenshtein distance is 3 because in order to go from bat to fads, we replaced two letters (b=>f, t=>d) and added a letter (+s). Each change increments the distance. We can watch the distance close as we step closer (so to speak). The total goes down until we get zero (the two strings are equal).
| SELECT levenshtein('bat', 'fad') fad, |
| levenshtein('bat', 'fat') fat, |
| levenshtein('bat', 'bat') bat; |
| |
| fad | fat | bat |
| -----+-----+----- |
| 2 | 1 | 0 |
Changes in case cost a point, too, so you may find it best to convert all strings to the same case when querying.
| SELECT movie_id, title FROM movies |
| WHERE levenshtein(lower(title), lower('a hard day nght')) <= 3; |
| |
| movie_id | title |
| ----------+-------------------- |
| 245 | A Hard Day's Night |
This ensures minor differences won’t over-inflate the distance.
A trigram is a group of three consecutive characters taken from a string. The pg_trgm contrib module breaks a string into as many trigrams as it can.
| SELECT show_trgm('Avatar'); |
| |
| show_trgm |
| ------------------------------------- |
| {" a"," av","ar ",ata,ava,tar,vat} |
Finding a matching string is as simple as counting the number of matching trigrams. The strings with the most matches are the most similar. It’s useful for doing a search where you’re okay with either slight misspellings or even minor words missing. The longer the string, the more trigrams and the more likely a match—they’re great for something like movie titles because they have relatively similar lengths. We’ll create a trigram index against movie names to start, using Generalized Index Search Tree (GIST), a generic index API made available by the PostgreSQL engine.
| CREATE INDEX movies_title_trigram ON movies |
| USING gist (title gist_trgm_ops); |
Now you can query with a few misspellings and still get decent results.
| SELECT title |
| FROM movies |
| WHERE title % 'Avatre'; |
| |
| title |
| --------- |
| Avatar |
Trigrams are an excellent choice for accepting user input without weighing queries down with wildcard complexity.
Next, we want to allow users to perform full-text searches based on matching words, even if they’re pluralized. If a user wants to search for certain words in a movie title but can remember only some of them, Postgres supports simple natural language processing.
Let’s look for a movie that contains the words night and day. This is a perfect job for text search using the @@ full-text query operator.
| SELECT title |
| FROM movies |
| WHERE title @@ 'night & day'; |
| |
| title |
| ------------------------------- |
| A Hard Day's Night |
| Six Days Seven Nights |
| Long Day's Journey Into Night |
The query returns titles like A Hard Day’s Night, despite the word Day being in possessive form and the fact that the two words are out of order in the query. The @@ operator converts the name field into a tsvector and converts the query into a tsquery.
A tsvector is a datatype that splits a string into an array (or a vector) of tokens, which are searched against the given query, while the tsquery represents a query in some language, like English or French. The language corresponds to a dictionary (which we’ll see more of in a few paragraphs). The previous query is equivalent to the following (if your system language is set to English):
| SELECT title |
| FROM movies |
| WHERE to_tsvector(title) @@ to_tsquery('english', 'night & day'); |
You can take a look at how the vector and the query break apart the values by running the conversion functions on the strings outright.
| SELECT to_tsvector('A Hard Day''s Night'), |
| to_tsquery('english', 'night & day'); |
| |
| to_tsvector | to_tsquery |
| ---------------------------+----------------- |
| 'day':3 'hard':2 'night':5 | 'night' & 'day' |
The tokens on a tsvector are called lexemes and are coupled with their positions in the given phrase.
You may have noticed the tsvector for A Hard Day’s Night did not contain the lexeme a. Moreover, simple English words such as a are missing if you try to query by them.
| SELECT * |
| FROM movies |
| WHERE title @@ to_tsquery('english', 'a'); |
| |
| NOTICE: text-search query contains only stop words or doesn't |
| contain lexemes, ignored |
Common words such as a are called stop words and are generally not useful for performing queries. The English dictionary was used by the parser to normalize our string into useful English components. In your console, you can view the output of the stop words under the English tsearch_data directory.
| $ cat `pg_config --sharedir`/tsearch_data/english.stop |
We could remove a from the list, or we could use another dictionary like simple that just breaks up strings by nonword characters and makes them lowercase. Compare these two vectors:
| SELECT to_tsvector('english', 'A Hard Day''s Night'); |
| |
| to_tsvector |
| ---------------------------- |
| 'day':3 'hard':2 'night':5 |
| |
| SELECT to_tsvector('simple', 'A Hard Day''s Night'); |
| |
| to_tsvector |
| ---------------------------------------- |
| 'a':1 'day':3 'hard':2 'night':5 's':4 |
With simple, you can retrieve any movie containing the lexeme a.
Because Postgres is doing some natural language processing here, it only makes sense that different configurations would be used for different languages. All of the installed configurations can be viewed with this command:
| 7dbs=# dF |
Dictionaries are part of what Postgres uses to generate tsvector lexemes (along with stop words and other tokenizing rules we haven’t covered called parsers and templates). You can view your system’s list here:
| 7dbs=# dFd |
You can test any dictionary outright by calling the ts_lexize function. Here we find the English stem word of the string Day’s.
| SELECT ts_lexize('english_stem', 'Day''s'); |
| |
| ts_lexize |
| ----------- |
| {day} |
Finally, the previous full-text commands work for other languages, too. If you have German installed, try this:
| SELECT to_tsvector('german', 'was machst du gerade?'); |
| |
| to_tsvector |
| -------------------- |
| 'gerad':4 'mach':2 |
Because was (what) and du (you) are common, they are marked as stop words in the German dictionary, while machst (doing) and gerade (at the moment) are stemmed.
Full-text search is powerful. But if we don’t index our tables, it’s also slow. The EXPLAIN command is a powerful tool for digging into how queries are internally planned.
| EXPLAIN |
| SELECT * |
| FROM movies |
| WHERE title @@ 'night & day'; |
| |
| QUERY PLAN |
| --------------------------------------------------------------------------- |
| Seq Scan on movies (cost=0.00..815.86 rows=3 width=171) |
| Filter: (title @@ 'night & day'::text) |
Note the line Seq Scan on movies. That’s rarely a good sign in a query because it means a whole table scan is taking place; each row will be read. That usually means that you need to create an index.
We’ll use Generalized Inverted iNdex (GIN)—like GIST, it’s an index API—to create an index of lexeme values we can query against. The term inverted index may sound familiar to you if you’ve ever used a search engine like Lucene or Sphinx. It’s a common data structure to index full-text searches.
| CREATE INDEX movies_title_searchable ON movies |
| USING gin(to_tsvector('english', title)); |
With our index in place, let’s try to search again.
| EXPLAIN |
| SELECT * |
| FROM movies |
| WHERE title @@ 'night & day'; |
| |
| QUERY PLAN |
| --------------------------------------------------------------------------- |
| Seq Scan on movies (cost=0.00..815.86 rows=3 width=171) |
| Filter: (title @@ 'night & day'::text) |
What happened? Nothing. The index is there, but Postgres isn’t using it because our GIN index specifically uses the english configuration for building its tsvectors, but we aren’t specifying that vector. We need to specify it in the WHERE clause of the query.
| EXPLAIN |
| SELECT * |
| FROM movies |
| WHERE to_tsvector('english',title) @@ 'night & day'; |
That will return this query plan:
| QUERY PLAN |
| ------------------------------------------------------------------ |
| Bitmap Heap Scan on movies (cost=20.00..24.26 rows=1 width=171) |
| Recheck Cond: (to_tsvector('english'::regconfig, title) @@ |
| '''night'' & ''day'''::tsquery) |
| -> Bitmap Index Scan on movies_title_searchable |
| (cost=0.00..20.00 rows=1 width=0) |
| Index Cond: (to_tsvector('english'::regconfig, title) @@ |
| '''night'' & ''day'''::tsquery) |
EXPLAIN is important to ensure indexes are used as you expect them. Otherwise, the index is just wasted overhead.
We’ve inched toward matching less specific inputs. LIKE and regular expressions require crafting patterns that can match strings precisely according to their format. Levenshtein distance allows you to find matches that contain minor misspellings but must ultimately be very close to the same string. Trigrams are a good choice for finding reasonable misspelled matches. Finally, full-text searching allows natural language flexibility in that it can ignore minor words such as a and the and can deal with pluralization. Sometimes we just don’t know how to spell words correctly but we know how they sound.
We love Bruce Willis and would love to see what movies he’s in. Unfortunately, we can’t remember exactly how to spell his name, so we sound it out as best we can.
| SELECT * |
| FROM actors |
| WHERE name = 'Broos Wils'; |
Even a trigram is no good here (using % rather than =).
| SELECT * |
| FROM actors |
| WHERE name % 'Broos Wils'; |
Enter the metaphones, which are algorithms for creating a string representation of word sounds. You can define how many characters are in the output string. For example, the seven-character metaphone of the name Aaron Eckhart is ARNKHRT.
To find all films with actors with names sounding like Broos Wils, we can query against the metaphone output. Note that NATURAL JOIN is an INNER JOIN that automatically joins ON matching column names (for example, movies.actor_id=movies_actors.actor_id).
| SELECT title |
| FROM movies NATURAL JOIN movies_actors NATURAL JOIN actors |
| WHERE metaphone(name, 6) = metaphone('Broos Wils', 6); |
| |
| title |
| ----------------------------- |
| The Fifth Element |
| Twelve Monkeys |
| Armageddon |
| Die Hard |
| Pulp Fiction |
| The Sixth Sense |
| : |
If you peek at the online documentation, you’ll see the fuzzystrmatch module contains other functions: dmetaphone (double metaphone), dmetaphone_alt (for alternative name pronunciations), and soundex (a really old algorithm from the 1880s made by the U.S. Census to compare common American surnames).
You can dissect the functions’ representations by selecting their output.
| SELECT name, dmetaphone(name), dmetaphone_alt(name), |
| metaphone(name, 8), soundex(name) |
| FROM actors; |
| |
| name | dmetaphone | dmetaphone_alt | metaphone | soundex |
| ----------------+------------+----------------+-----------+-------- |
| 50 Cent | SNT | SNT | SNT | C530 |
| Aaron Eckhart | ARNK | ARNK | ARNKHRT | A652 |
| Agatha Hurle | AK0R | AKTR | AK0HRL | A236 |
There is no single best function to choose, and the optimal choice depends on your dataset and use case.
With all of our string searching ducks in a row, we’re ready to start combining them in interesting ways.
One of the most flexible aspects of metaphones is that their outputs are just strings. This allows you to mix and match with other string matchers.
For example, we could use the trigram operator against metaphone outputs and then order the results by the lowest Levenshtein distance. This means “Get me names that sound the most like Robin Williams, in order.”
| SELECT * FROM actors |
| WHERE metaphone(name,8) % metaphone('Robin Williams',8) |
| ORDER BY levenshtein(lower('Robin Williams'), lower(name)); |
| |
| actor_id | name |
| ----------+----------------- |
| 4093 | Robin Williams |
| 2442 | John Williams |
| 4479 | Steven Williams |
| 4090 | Robin Shou |
Be warned, though, that unbridled exploitation of this flexibility can yield funny results.
| SELECT * FROM actors WHERE dmetaphone(name) % dmetaphone('Ron'); |
This will return a result set that includes actors like Renée Zellweger, Ringo Starr, and Randy Quaid.
The combinations are vast, limited only by your experimentations.
The last contributed package we investigate is cube. We’ll use the cube datatype to map a movie’s genres as a multidimensional vector. We will then use methods to efficiently query for the closest points within the boundary of a hypercube to give us a list of similar movies.
As you may have noticed in the beginning of Day 3, we created a column named genres of type cube. Each value is a point in 18-dimensional space with each dimension representing a genre. Why represent movie genres as points in n-dimensional space? Movie categorization is not an exact science, and many movies are not 100 percent comedy or 100 percent tragedy—they are something in between.
In our system, each genre is scored from (the totally arbitrary numbers) 0 to 10 based on how strong the movie is within that genre—with 0 being nonexistent and 10 being the strongest.
Star Wars, for example, has a genre vector of (0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0). The genres table describes the position of each dimension in the vector. We can decrypt its genre values by extracting the cube_ur_coord(vector,dimension) using each genres.position. For clarity, we filter out genres with scores of 0.
| SELECT name, |
| cube_ur_coord('(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)', position) as score |
| FROM genres g |
| WHERE cube_ur_coord('(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)', position) > 0; |
| |
| name | score |
| -----------+------- |
| Adventure | 7 |
| Fantasy | 7 |
| SciFi | 10 |
We will find similar movies by finding the nearest points. To understand why this works, we can envision two movies on a two-dimensional genre graph, like the graph shown below. If your favorite movie is Animal House, you’ll probably want to see The 40-Year-Old Virgin more than Oedipus—a story famously lacking in comedy. In our two-dimensional universe, it’s a simple nearest-neighbor search to find likely matches.
We can extrapolate this into more dimensions with more genres, be it 2, 3, or 18. The principle is the same: a nearest-neighbor match to the nearest points in genre space will yield the closest genre matches.
The nearest matches to the genre vector can be discovered by the cube_distance(point1, point2). Here we can find the distance of all movies to the Star Wars genre vector, nearest first.
| SELECT *, |
| cube_distance(genre, '(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)') dist |
| FROM movies |
| ORDER BY dist; |
We created the movies_genres_cube cube index earlier when we created the tables. However, even with an index, this query is still relatively slow because it requires a full-table scan. It computes the distance on every row and then sorts them.
Rather than compute the distance of every point, we can instead focus on likely points by way of a bounding cube. Just like finding the closest five towns on a map will be faster on a state map than a world map, bounding reduces the points we need to look at.
We use cube_enlarge(cube,radius,dimensions) to build an 18-dimensional cube that is some length (radius) wider than a point.
Let’s view a simpler example. If we built a two-dimensional square one unit around a point (1,1), the lower-left point of the square would be at (0,0), and the upper-right point would be (2,2).
| SELECT cube_enlarge('(1,1)',1,2); |
| |
| cube_enlarge |
| --------------- |
| (0, 0),(2, 2) |
The same principle applies in any number of dimensions. With our bounding hypercube, we can use a special cube operator, @>, which means contains. This query finds the distance of all points contained within a five-unit cube of the Star Wars genre point.
| SELECT title, |
| cube_distance(genre, '(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)') dist |
| FROM movies |
| WHERE cube_enlarge('(0,7,0,0,0,0,0,0,0,7,0,0,0,0,10,0,0,0)'::cube, 5, 18) |
| @> genre |
| ORDER BY dist; |
| |
| title | dist |
| ------------------------------------------------+------------------ |
| Star Wars | 0 |
| Star Wars: Episode V - The Empire Strikes Back | 2 |
| Avatar | 5 |
| Explorers | 5.74456264653803 |
| Krull | 6.48074069840786 |
| E.T. The Extra-Terrestrial | 7.61577310586391 |
Using a subselect, we can get the genre by movie name and perform our calculations against that genre using a table alias.
| SELECT m.movie_id, m.title |
| FROM movies m, (SELECT genre, title FROM movies WHERE title = 'Mad Max') s |
| WHERE cube_enlarge(s.genre, 5, 18) @> m.genre AND s.title <> m.title |
| ORDER BY cube_distance(m.genre, s.genre) |
| LIMIT 10; |
| |
| movie_id | title |
| ----------+---------------------------- |
| 1405 | Cyborg |
| 1391 | Escape from L.A. |
| 1192 | Mad Max Beyond Thunderdome |
| 1189 | Universal Soldier |
| 1222 | Soldier |
| 1362 | Johnny Mnemonic |
| 946 | Alive |
| 418 | Escape from New York |
| 1877 | The Last Starfighter |
| 1445 | The Rocketeer |
This method of movie suggestion is not perfect, but it’s an excellent start. We will see more dimensional queries in later chapters, such as two-dimensional geographic searches in MongoDB (see GeoSpatial Queries).
Today we jumped headlong into PostgreSQL’s flexibility in performing string searches and used the cube package for multidimensional searching. Most importantly, we caught a glimpse of the nonstandard extensions that put PostgreSQL at the top of the open source RDBMS field. There are dozens (if not hundreds) more extensions at your disposal, from geographic storage to cryptographic functions, custom datatypes, and language extensions. Beyond the core power of SQL, contrib packages make PostgreSQL shine.
Find
Find the online documentation listing all contributed packages bundled into Postgres. Read up on two that you could imagine yourself using in one of your projects.
Find the online POSIX regex documentation (it will also come in handy in future chapters).
Do
Create a stored procedure that enables you to input a movie title or an actor’s name and then receive the top five suggestions based on either movies the actor has starred in or films with similar genres.
Expand the movies database to track user comments and extract keywords (minus English stopwords). Cross-reference these keywords with actors’ last names and try to find the most talked-about actors.