Day 2: Advanced Queries, Code, and Rules

Yesterday we saw how to define tables, populate them with data, update and delete rows, and perform basic reads. Today we’ll dig even deeper into the myriad ways that PostgreSQL can query data. We’ll see how to group similar values, execute code on the server, and create custom interfaces using views and rules. We’ll finish the day by using one of PostgreSQL’s contributed packages to flip tables on their heads.

Aggregate Functions

An aggregate query groups results from several rows by some common criteria. It can be as simple as counting the number of rows in a table or calculating the average of some numerical column. They’re powerful SQL tools and also a lot of fun.

Let’s try some aggregate functions, but first we’ll need some more data in our database. Enter your own country into the countries table, your own city into the cities table, and your own address as a venue (which we just named My Place). Then add a few records to the events table.

Here’s a quick SQL tip: Rather than setting the venue_id explicitly, you can sub-SELECT it using a more human-readable title. If Moby is playing at the Crystal Ballroom, set the venue_id like this:

 INSERT​ ​INTO​ ​events​ (title, ​starts​, ​ends​, venue_id)
 VALUES​ (​'Moby'​, ​'2018-02-06 21:00'​, ​'2018-02-06 23:00'​, (
 SELECT​ venue_id
 FROM​ venues
 WHERE​ ​name​ = ​'Crystal Ballroom'
  )
 );

Populate your events table with the following data (to enter Valentine’s Day in PostgreSQL, you can escape the apostrophe with two, such as Heaven”s Gate):

  title | starts | ends | venue
 -----------------+---------------------+---------------------+---------------
  Wedding | 2018-02-26 21:00:00 | 2018-02-26 23:00:00 | Voodoo Doughnut
  Dinner with Mom | 2018-02-26 18:00:00 | 2018-02-26 20:30:00 | My Place
  Valentine's Day | 2018-02-14 00:00:00 | 2018-02-14 23:59:00 |

With our data set up, let’s try some aggregate queries. The simplest aggregate function is count, which is fairly self-explanatory. Counting all titles that contain the word Day (note: % is a wildcard on LIKE searches), you should receive a value of 3.

 SELECT​ count(title)
 FROM​ ​events
 WHERE​ title ​LIKE​ ​'%Day%'​;

To get the first start time and last end time of all events at the Crystal Ballroom, use min (return the smallest value) and max (return the largest value).

 SELECT​ min(​starts​), max(​ends​)
 FROM​ ​events​ ​INNER​ ​JOIN​ venues
 ON​ ​events​.venue_id = venues.venue_id
 WHERE​ venues.​name​ = ​'Crystal Ballroom'​;
 
  min | max
 ---------------------+---------------------
  2018-02-06 21:00:00 | 2018-02-06 23:00:00

Aggregate functions are useful but limited on their own. If we wanted to count all events at each venue, we could write the following for each venue ID:

 SELECT​ count(*) ​FROM​ ​events​ ​WHERE​ venue_id = 1;
 SELECT​ count(*) ​FROM​ ​events​ ​WHERE​ venue_id = 2;
 SELECT​ count(*) ​FROM​ ​events​ ​WHERE​ venue_id = 3;
 SELECT​ count(*) ​FROM​ ​events​ ​WHERE​ venue_id ​IS​ ​NULL​;

This would be tedious (intractable even) as the number of venues grows. This is where the GROUP BY command comes in handy.

Grouping

GROUP BY is a shortcut for running the previous queries all at once. With GROUP BY, you tell Postgres to place the rows into groups and then perform some aggregate function (such as count) on those groups.

 SELECT​ venue_id, count(*)
 FROM​ ​events
 GROUP​ ​BY​ venue_id;
 
  venue_id | count
 ----------+-------
  1 | 1
  2 | 2
  3 | 1
  4 | 3

It’s a nice list, but can we filter by the count function? Absolutely. The GROUP BY condition has its own filter keyword: HAVING. HAVING is like the WHERE clause, except it can filter by aggregate functions (whereas WHERE cannot).

The following query SELECTs the most popular venues, those with two or more events:

 SELECT​ venue_id
 FROM​ ​events
 GROUP​ ​BY​ venue_id
 HAVING​ count(*) >= 2 ​AND​ venue_id ​IS​ ​NOT​ ​NULL​;
 
  venue_id | count
 ----------+-------
  2 | 2

You can use GROUP BY without any aggregate functions. If you call SELECT... FROM...GROUP BY on one column, you get only unique values.

 SELECT​ venue_id ​FROM​ ​events​ ​GROUP​ ​BY​ venue_id;

This kind of grouping is so common that SQL has a shortcut for it in the DISTINCT keyword.

 SELECT​ ​DISTINCT​ venue_id ​FROM​ ​events​;

The results of both queries will be identical.

Window Functions

If you’ve done any sort of production work with a relational database in the past, you are likely familiar with aggregate queries. They are a common SQL staple. Window functions, on the other hand, are not quite so common (PostgreSQL is one of the few open source databases to implement them).

Window functions are similar to GROUP BY queries in that they allow you to run aggregate functions across multiple rows. The difference is that they allow you to use built-in aggregate functions without requiring every single field to be grouped to a single row.

If we attempt to select the title column without grouping by it, we can expect an error.

 SELECT​ title, venue_id, count(*)
 FROM​ ​events
 GROUP​ ​BY​ venue_id;
 
 ERROR​: ​column​ "events.title" must appear ​in​ the ​GROUP​ ​BY​ clause ​or​ ​
  be used ​in​ an ​aggregate​ ​function

We are counting up the rows by venue_id, and in the case of Fight Club and Wedding, we have two titles for a single venue_id. Postgres doesn’t know which title to display.

Whereas a GROUP BY clause will return one record per matching group value, a window function, which does not collapse the results per group, can return a separate record for each row. For a visual representation, see the following figure.

images/postgres-window-func.png

Let’s see an example of the sweet spot that window functions attempt to hit.

Window functions return all matches and replicate the results of any aggregate function.

 SELECT​ title, count(*) OVER (​PARTITION​ ​BY​ venue_id) ​FROM​ ​events​;
 
  title | count
 -------------+-------
  Moby | 1
  Fight Club | 1
  House Party | 3
  House Party | 3
  House Party | 3
 (5 ​rows​)

We like to think of PARTITION BY as akin to GROUP BY, but rather than grouping the results outside of the SELECT attribute list (and thus combining the results into fewer rows), it returns grouped values as any other field (calculating on the grouped variable but otherwise just another attribute). Or in SQL parlance, it returns the results of an aggregate function OVER a PARTITION of the result set.

Transactions

Transactions are the bulwark of relational database consistency. All or nothing, that’s the transaction motto. Transactions ensure that every command of a set is executed. If anything fails along the way, all of the commands are rolled back as if they never happened.

PostgreSQL transactions follow ACID compliance, which stands for:

  • Atomic (either all operations succeed or none do)

  • Consistent (the data will always be in a good state and never in an inconsistent state)

  • Isolated (transactions don’t interfere with one another)

  • Durable (a committed transaction is safe, even after a server crash)

We should note here that consistency in ACID is different from consistency in CAP (covered in Appendix 2, The CAP Theorem).

We can wrap any transaction within a BEGIN TRANSACTION block. To verify atomicity, we’ll kill the transaction with the ROLLBACK command.

 BEGIN​ ​TRANSACTION​;
 DELETE​ ​FROM​ ​events​;
 ROLLBACK​;
 SELECT​ * ​FROM​ ​events​;

The events all remain. Transactions are useful when you’re modifying two tables that you don’t want out of sync. The classic example is a debit/credit system for a bank, where money is moved from one account to another:

 BEGIN​ ​TRANSACTION​;
 UPDATE​ account ​SET​ total=total+5000.0 ​WHERE​ account_id=1337;
 UPDATE​ account ​SET​ total=total-5000.0 ​WHERE​ account_id=45887;
 END​;

If something happened between the two updates, this bank just lost five grand. But when wrapped in a transaction block, the initial update is rolled back, even if the server explodes.

Stored Procedures

Every command we’ve seen until now has been declarative in the sense that we‘ve been able to get our desired result set using just SQL (which is quite powerful in itself). But sometimes the database doesn‘t give us everything we need natively and we need to run some code to fill in the gaps. At that point, though, you need to decide where the code is going to run. Should it run in Postgres or should it run on the application side?

If you decide you want the database to do the heavy lifting, Postgres offers stored procedures. Stored procedures are extremely powerful and can be used to do an enormous range of tasks, from performing complex mathematical operations that aren’t supported in SQL to triggering cascading series of events to pre-validating data before it‘s written to tables and far beyond. On the one hand, stored procedures can offer huge performance advantages. But the architectural costs can be high (and sometimes not worth it). You may avoid streaming thousands of rows to a client application, but you have also bound your application code to this database. And so the decision to use stored procedures should not be made lightly.

Caveats aside, let’s create a procedure (or FUNCTION) that simplifies INSERTing a new event at a venue without needing the venue_id. Here‘s what the procedure will accomplish: if the venue doesn’t exist, it will be created first and then referenced in the new event. The procedure will also return a Boolean indicating whether a new venue was added as a helpful bonus.

 CREATE​ ​OR​ ​REPLACE​ ​FUNCTION​ add_event(
  title ​text​,
 starts​ ​timestamp​,
 ends​ ​timestamp​,
  venue ​text​,
  postal ​varchar​(9),
  country ​char​(2))
 RETURNS​ ​boolean​ ​AS​ ​$$
 DECLARE
  did_insert ​boolean​ := ​false​;
  found_count ​integer​;
  the_venue_id ​integer​;
 BEGIN
 SELECT​ venue_id ​INTO​ the_venue_id
 FROM​ venues v
 WHERE​ v.postal_code=postal ​AND​ v.country_code=country ​AND​ v.​name​ ILIKE venue
 LIMIT​ 1;
 
 IF​ the_venue_id ​IS​ ​NULL​ ​THEN
 INSERT​ ​INTO​ venues (​name​, postal_code, country_code)
 VALUES​ (venue, postal, country)
  RETURNING venue_id ​INTO​ the_venue_id;
 
  did_insert := ​true​;
 END​ ​IF​;
 
 -- Note: this is a notice, not an error as in some programming languages
  RAISE NOTICE ​'Venue found %'​, the_venue_id;
 
 INSERT​ ​INTO​ ​events​ (title, ​starts​, ​ends​, venue_id)
 VALUES​ (title, ​starts​, ​ends​, the_venue_id);
 
 RETURN​ did_insert;
 END​;
 $$​ ​LANGUAGE​ plpgsql;

You can import this external file into the current schema using the following command-line argument (if you don’t feel like typing all that code).

 7dbs=# ​​i add_event.​sql

This stored procedure is run as a SELECT statement.

 SELECT​ add_event(​'House Party'​, ​'2018-05-03 23:00'​,
 '2018-05-04 02:00'​, ​'Run​​''​​s House'​, ​'97206'​, ​'us'​);

Running it should return t (true) because this is the first use of the venue Run’s House. This saves a client two round-trip SQL commands to the database (a SELECT and then an INSERT) and instead performs only one.

The language we used in the procedure we wrote is PL/pgSQL (which stands for Procedural Language/PostgreSQL). Covering the details of an entire programming language is beyond the scope of this book, but you can read much more about it in the online PostgreSQL documentation.[6]

In addition to PL/pgSQL, Postgres supports three more core languages for writing procedures: Tcl (PL/Tcl), Perl (PL/Perl), and Python (PL/Python). People have written extensions for a dozen more, including Ruby, Java, PHP, Scheme, and others listed in the public documentation. Try this shell command:

 $ ​​createlang​​ ​​7dbs​​ ​​--list

It will list the languages installed in your database. The createlang command is also used to add new languages, which you can find online.[7]

Pull the Triggers

Triggers automatically fire stored procedures when some event happens, such as an insert or update. They allow the database to enforce some required behavior in response to changing data.

Let’s create a new PL/pgSQL function that logs whenever an event is updated (we want to be sure no one changes an event and tries to deny it later). First, create a logs table to store event changes. A primary key isn’t necessary here because it’s just a log.

 CREATE​ ​TABLE​ ​logs​ (
  event_id ​integer​,
  old_title ​varchar​(255),
  old_starts ​timestamp​,
  old_ends ​timestamp​,
  logged_at ​timestamp​ ​DEFAULT​ ​current_timestamp
 );

Next, we build a function to insert old data into the log. The OLD variable represents the row about to be changed (NEW represents an incoming row, which we’ll see in action soon enough). Output a notice to the console with the event_id before returning.

 CREATE​ ​OR​ ​REPLACE​ ​FUNCTION​ log_event() ​RETURNS​ ​trigger​ ​AS​ ​$$
 DECLARE
 BEGIN
 INSERT​ ​INTO​ ​logs​ (event_id, old_title, old_starts, old_ends)
 VALUES​ (OLD.event_id, OLD.title, OLD.​starts​, OLD.​ends​);
  RAISE NOTICE ​'Someone just changed event #%'​, OLD.event_id;
 RETURN​ ​NEW​;
 END​;
 $$​ ​LANGUAGE​ plpgsql;

Finally, we create our trigger to log changes after any row is updated.

 CREATE​ ​TRIGGER​ log_events
 AFTER​ ​UPDATE​ ​ON​ ​events
 FOR​ ​EACH​ ​ROW​ ​EXECUTE​ ​PROCEDURE​ log_event();

So, it turns out our party at Run’s House has to end earlier than we hoped. Let’s change the event.

 UPDATE​ ​events
 SET​ ​ends​=​'2018-05-04 01:00:00'
 WHERE​ title=​'House Party'​;
 
 NOTICE: Someone just ​changed​ ​event​ #9

And the old end time was logged.

 SELECT​ event_id, old_title, old_ends, logged_at
 FROM​ ​logs​;
 
 event_id | old_title | old_ends | logged_at
 ---------+-------------+---------------------+------------------------
  9 | House Party | 2018-05-04 02:00:00 | 2017-02-26 15:50:31.939

Triggers can also be created before updates and before or after inserts.[8]

Viewing the World

Wouldn’t it be great if you could use the results of a complex query just like any other table? Well, that’s exactly what VIEWs are for. Unlike stored procedures, these aren’t functions being executed but rather aliased queries. Let’s say that we wanted to see only holidays that contain the word Day and have no venue. We could create a VIEW for that like this:

 CREATE​ ​VIEW​ holidays ​AS
 SELECT​ event_id ​AS​ holiday_id, title ​AS​ ​name​, ​starts​ ​AS​ ​date
 FROM​ ​events
 WHERE​ title ​LIKE​ ​'%Day%'​ ​AND​ venue_id ​IS​ ​NULL​;

Creating a view really is as simple as writing a query and prefixing it with CREATE VIEW some_view_name AS. Now you can query holidays like any other table. Under the covers it’s the plain old events table. As proof, add Valentine’s Day on 2018-02-14 to events and query the holidays view.

 SELECT​ ​name​, to_char(​date​, ​'Month DD, YYYY'​) ​AS​ ​date
 FROM​ holidays
 WHERE​ ​date​ <= ​'2018-04-01'​;
 
 name​ | ​date
 ------------------+--------------------
  April Fools ​Day​ | April 01, 2018
  Valentine​'s Day | February 14, 2018

Views are powerful tools for opening up complex queried data in a simple way. The query may be a roiling sea of complexity underneath, but all you see is a table.

If you want to add a new column to the holidays view, it will have to come from the underlying table. Let’s alter the events table to have an array of associated colors.

 ALTER​ ​TABLE​ ​events
 ADD​ colors ​text​ ARRAY;

Because holidays are to have colors associated with them, let’s update the VIEW query to contain the colors array.

 CREATE​ ​OR​ ​REPLACE​ ​VIEW​ holidays ​AS
 SELECT​ event_id ​AS​ holiday_id, title ​AS​ ​name​, ​starts​ ​AS​ ​date​, colors
 FROM​ ​events
 WHERE​ title ​LIKE​ ​'%Day%'​ ​AND​ venue_id ​IS​ ​NULL​;

Now it’s a matter of setting an array or color strings to the holiday of choice. Unfortunately, we cannot update a VIEW directly.

 UPDATE​ holidays ​SET​ colors = ​'{"red","green"}'​ ​where​ ​name​ = ​'Christmas Day'​;
 
 ERROR​: cannot ​update​ a ​view
 HINT: You need an unconditional ​ON​ ​UPDATE​ ​DO​ INSTEAD rule.

Looks like we need a RULE instead of a view.

What RULEs the School?

A RULE is a description of how to alter the parsed query tree. Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).

Operators and values become branches and leaves in the tree, and the tree is walked, pruned, and in other ways edited before execution. This tree is optionally rewritten by Postgres rules, before being sent on to the query planner (which also rewrites the tree to run optimally), and sends this final command to be executed. See how SQL gets executed in PostgreSQL in the figure.

images/postgres-overview.png

In fact, a VIEW such as holidays is a RULE. We can prove this by taking a look at the execution plan of the holidays view using the EXPLAIN command (notice Filter is the WHERE clause, and Output is the column list).

 EXPLAIN​ VERBOSE
 SELECT​ *
 FROM​ holidays;
 
 QUERY​ PLAN
 -----------------------------------------------------------------------------
  Seq Scan ​on​ public.​events​ (cost=0.00..1.01 ​rows​=1 width=44)
  Output: ​events​.event_id, ​events​.title, ​events​.​starts​, ​events​.colors
  Filter: ((​events​.venue_id ​IS​ ​NULL​) ​AND
  ((​events​.title)::​text​ ~~ ​'%Day%'​::​text​))

Compare that to running EXPLAIN VERBOSE on the query from which we built the holidays  VIEW. They’re functionally identical.

 EXPLAIN​ VERBOSE
 SELECT​ event_id ​AS​ holiday_id,
  title ​AS​ ​name​, ​starts​ ​AS​ ​date​, colors
 FROM​ ​events
 WHERE​ title ​LIKE​ ​'%Day%'​ ​AND​ venue_id ​IS​ ​NULL​;
 
 QUERY​ PLAN
 -----------------------------------------------------------------------------
  Seq Scan ​on​ public.​events​ (cost=0.00..1.04 ​rows​=1 width=57)
  Output: event_id, title, ​starts​, colors
  Filter: ((​events​.venue_id ​IS​ ​NULL​) ​AND
  ((​events​.title)::​text​ ~~ ​'%Day%'​::​text​))

So, to allow updates against our holidays view, we need to craft a RULE that tells Postgres what to do with an UPDATE. Our rule will capture updates to the holidays view and instead run the update on events, pulling values from the pseudorelations NEW and OLD. NEW functionally acts as the relation containing the values we’re setting, while OLD contains the values we query by.

 CREATE​ RULE update_holidays ​AS​ ​ON​ ​UPDATE​ ​TO​ holidays ​DO​ INSTEAD
 UPDATE​ ​events
 SET​ title = ​NEW​.​name​,
 starts​ = ​NEW​.​date​,
  colors = ​NEW​.colors
 WHERE​ title = OLD.​name​;

With this rule in place, now we can update holidays directly.

 UPDATE​ holidays ​SET​ colors = ​'{"red","green"}'​ ​where​ ​name​ = ​'Christmas Day'​;

Next, let’s insert New Years Day on 2013-01-01 into holidays. As expected, we need a rule for that too. No problem.

 CREATE​ RULE insert_holidays ​AS​ ​ON​ ​INSERT​ ​TO​ holidays ​DO​ INSTEAD
 INSERT​ ​INTO​ ...

We’re going to move on from here, but if you’d like to play more with RULEs, try to add a DELETE RULE.

I’ll Meet You at the Crosstab

For our last exercise of the day, we’re going to build a monthly calendar of events, where each month in the calendar year counts the number of events in that month. This kind of operation is commonly done by a pivot table. These constructs “pivot” grouped data around some other output, in our case a list of months. We’ll build our pivot table using the crosstab function.

Start by crafting a query to count the number of events per month each year. PostgreSQL provides an extract function that returns some subfield from a date or timestamp, which aids in our grouping.

 SELECT​ extract(​year​ ​from​ ​starts​) ​as​ ​year​,
  extract(​month​ ​from​ ​starts​) ​as​ ​month​, count(*)
 FROM​ ​events
 GROUP​ ​BY​ ​year​, ​month
 ORDER​ ​BY​ ​year​, ​month​;

To use crosstab, the query must return three columns: rowid, category, and value. We’ll be using the year as an ID, which means the other fields are category (the month) and value (the count).

The crosstab function needs another set of values to represent months. This is how the function knows how many columns we need. These are the values that become the columns (the table to pivot against). So let’s create a table to store a list of numbers. Because we‘ll only need the table for a few operations, we‘ll create an ephemeral table, which lasts only as long as the current Postgres session, using the CREATE TEMPORARY TABLE command.

 CREATE​ ​TEMPORARY​ ​TABLE​ month_count(​month​ ​INT​);
 INSERT​ ​INTO​ month_count ​VALUES​ (1),(2),(3),(4),(5),
  (6),(7),(8),(9),(10),(11),(12);

Now we’re ready to call crosstab with our two queries.

 SELECT​ * ​FROM​ crosstab(
 'SELECT extract(year from starts) as year,
  extract(month from starts) as month, count(*)
  FROM events
  GROUP BY year, month
  ORDER BY year, month'​,
 'SELECT * FROM month_count'
 );
 
 ERROR​: a ​column​ definition ​list​ ​is​ required ​for​ functions returning "record"

Oops. An error occurred. This cryptic error is basically saying that the function is returning a set of records (rows) but it doesn’t know how to label them. In fact, it doesn’t even know what datatypes they are.

Remember, the pivot table is using our months as categories, but those months are just integers. So, we define them like this:

 SELECT​ * ​FROM​ crosstab(
 'SELECT extract(year from starts) as year,
  extract(month from starts) as month, count(*)
  FROM events
  GROUP BY year, month
  ORDER BY year, month'​,
 'SELECT * FROM month_count'
 ) ​AS​ (
 year​ ​int​,
  jan ​int​, feb ​int​, mar ​int​, apr ​int​, may ​int​, jun ​int​,
  jul ​int​, aug ​int​, sep ​int​, oct ​int​, nov ​int​, ​dec​ ​int
 ) ​ORDER​ ​BY​ ​YEAR​;

We have one column year (which is the row ID) and twelve more columns representing the months.

 year​ | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | ​dec
 -----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
 2018 | | 5 | | 1 | 1 | | | | | | | 1

Go ahead and add a couple more events on another year just to see next year’s event counts. Run the crosstab function again, and enjoy the calendar.

Day 2 Wrap-Up

Today finalized the basics of PostgreSQL. What we’re starting to see is that Postgres is more than just a server for storing vanilla datatypes and querying them. Instead, it’s a powerful data management engine that can reformat output data, store weird datatypes such as arrays, execute logic, and provide enough power to rewrite incoming queries.

Day 2 Homework

Find

  1. Find the list of aggregate functions in the PostgreSQL docs.
  2. Find a GUI program to interact with PostgreSQL, such as pgAdmin, Datagrip, or Navicat.

Do

  1. Create a rule that captures DELETEs on venues and instead sets the active flag (created in the Day 1 homework) to FALSE.

  2. A temporary table was not the best way to implement our event calendar pivot table. The generate_series(a, b) function returns a set of records, from a to b. Replace the month_count table SELECT with this.

  3. Build a pivot table that displays every day in a single month, where each week of the month is a row and each day name forms a column across the top (seven days, starting with Sunday and ending with Saturday) like a standard month calendar. Each day should contain a count of the number of events for that date or should remain blank if no event occurs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset