Yesterday we saw how to define tables, populate them with data, update and delete rows, and perform basic reads. Today we’ll dig even deeper into the myriad ways that PostgreSQL can query data. We’ll see how to group similar values, execute code on the server, and create custom interfaces using views and rules. We’ll finish the day by using one of PostgreSQL’s contributed packages to flip tables on their heads.
An aggregate query groups results from several rows by some common criteria. It can be as simple as counting the number of rows in a table or calculating the average of some numerical column. They’re powerful SQL tools and also a lot of fun.
Let’s try some aggregate functions, but first we’ll need some more data in our database. Enter your own country into the countries table, your own city into the cities table, and your own address as a venue (which we just named My Place). Then add a few records to the events table.
Here’s a quick SQL tip: Rather than setting the venue_id explicitly, you can sub-SELECT it using a more human-readable title. If Moby is playing at the Crystal Ballroom, set the venue_id like this:
| INSERT INTO events (title, starts, ends, venue_id) |
| VALUES ('Moby', '2018-02-06 21:00', '2018-02-06 23:00', ( |
| SELECT venue_id |
| FROM venues |
| WHERE name = 'Crystal Ballroom' |
| ) |
| ); |
Populate your events table with the following data (to enter Valentine’s Day in PostgreSQL, you can escape the apostrophe with two, such as Heaven”s Gate):
| title | starts | ends | venue |
| -----------------+---------------------+---------------------+--------------- |
| Wedding | 2018-02-26 21:00:00 | 2018-02-26 23:00:00 | Voodoo Doughnut |
| Dinner with Mom | 2018-02-26 18:00:00 | 2018-02-26 20:30:00 | My Place |
| Valentine's Day | 2018-02-14 00:00:00 | 2018-02-14 23:59:00 | |
With our data set up, let’s try some aggregate queries. The simplest aggregate function is count, which is fairly self-explanatory. Counting all titles that contain the word Day (note: % is a wildcard on LIKE searches), you should receive a value of 3.
| SELECT count(title) |
| FROM events |
| WHERE title LIKE '%Day%'; |
To get the first start time and last end time of all events at the Crystal Ballroom, use min (return the smallest value) and max (return the largest value).
| SELECT min(starts), max(ends) |
| FROM events INNER JOIN venues |
| ON events.venue_id = venues.venue_id |
| WHERE venues.name = 'Crystal Ballroom'; |
| |
| min | max |
| ---------------------+--------------------- |
| 2018-02-06 21:00:00 | 2018-02-06 23:00:00 |
Aggregate functions are useful but limited on their own. If we wanted to count all events at each venue, we could write the following for each venue ID:
| SELECT count(*) FROM events WHERE venue_id = 1; |
| SELECT count(*) FROM events WHERE venue_id = 2; |
| SELECT count(*) FROM events WHERE venue_id = 3; |
| SELECT count(*) FROM events WHERE venue_id IS NULL; |
This would be tedious (intractable even) as the number of venues grows. This is where the GROUP BY command comes in handy.
GROUP BY is a shortcut for running the previous queries all at once. With GROUP BY, you tell Postgres to place the rows into groups and then perform some aggregate function (such as count) on those groups.
| SELECT venue_id, count(*) |
| FROM events |
| GROUP BY venue_id; |
| |
| venue_id | count |
| ----------+------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 3 |
It’s a nice list, but can we filter by the count function? Absolutely. The GROUP BY condition has its own filter keyword: HAVING. HAVING is like the WHERE clause, except it can filter by aggregate functions (whereas WHERE cannot).
The following query SELECTs the most popular venues, those with two or more events:
| SELECT venue_id |
| FROM events |
| GROUP BY venue_id |
| HAVING count(*) >= 2 AND venue_id IS NOT NULL; |
| |
| venue_id | count |
| ----------+------- |
| 2 | 2 |
You can use GROUP BY without any aggregate functions. If you call SELECT... FROM...GROUP BY on one column, you get only unique values.
| SELECT venue_id FROM events GROUP BY venue_id; |
This kind of grouping is so common that SQL has a shortcut for it in the DISTINCT keyword.
| SELECT DISTINCT venue_id FROM events; |
The results of both queries will be identical.
If you’ve done any sort of production work with a relational database in the past, you are likely familiar with aggregate queries. They are a common SQL staple. Window functions, on the other hand, are not quite so common (PostgreSQL is one of the few open source databases to implement them).
Window functions are similar to GROUP BY queries in that they allow you to run aggregate functions across multiple rows. The difference is that they allow you to use built-in aggregate functions without requiring every single field to be grouped to a single row.
If we attempt to select the title column without grouping by it, we can expect an error.
| SELECT title, venue_id, count(*) |
| FROM events |
| GROUP BY venue_id; |
| |
| ERROR: column "events.title" must appear in the GROUP BY clause or |
| be used in an aggregate function |
We are counting up the rows by venue_id, and in the case of Fight Club and Wedding, we have two titles for a single venue_id. Postgres doesn’t know which title to display.
Whereas a GROUP BY clause will return one record per matching group value, a window function, which does not collapse the results per group, can return a separate record for each row. For a visual representation, see the following figure.
Let’s see an example of the sweet spot that window functions attempt to hit.
Window functions return all matches and replicate the results of any aggregate function.
| SELECT title, count(*) OVER (PARTITION BY venue_id) FROM events; |
| |
| title | count |
| -------------+------- |
| Moby | 1 |
| Fight Club | 1 |
| House Party | 3 |
| House Party | 3 |
| House Party | 3 |
| (5 rows) |
We like to think of PARTITION BY as akin to GROUP BY, but rather than grouping the results outside of the SELECT attribute list (and thus combining the results into fewer rows), it returns grouped values as any other field (calculating on the grouped variable but otherwise just another attribute). Or in SQL parlance, it returns the results of an aggregate function OVER a PARTITION of the result set.
Transactions are the bulwark of relational database consistency. All or nothing, that’s the transaction motto. Transactions ensure that every command of a set is executed. If anything fails along the way, all of the commands are rolled back as if they never happened.
PostgreSQL transactions follow ACID compliance, which stands for:
Atomic (either all operations succeed or none do)
Consistent (the data will always be in a good state and never in an inconsistent state)
Isolated (transactions don’t interfere with one another)
Durable (a committed transaction is safe, even after a server crash)
We should note here that consistency in ACID is different from consistency in CAP (covered in Appendix 2, The CAP Theorem).
We can wrap any transaction within a BEGIN TRANSACTION block. To verify atomicity, we’ll kill the transaction with the ROLLBACK command.
| BEGIN TRANSACTION; |
| DELETE FROM events; |
| ROLLBACK; |
| SELECT * FROM events; |
The events all remain. Transactions are useful when you’re modifying two tables that you don’t want out of sync. The classic example is a debit/credit system for a bank, where money is moved from one account to another:
| BEGIN TRANSACTION; |
| UPDATE account SET total=total+5000.0 WHERE account_id=1337; |
| UPDATE account SET total=total-5000.0 WHERE account_id=45887; |
| END; |
If something happened between the two updates, this bank just lost five grand. But when wrapped in a transaction block, the initial update is rolled back, even if the server explodes.
Every command we’ve seen until now has been declarative in the sense that we‘ve been able to get our desired result set using just SQL (which is quite powerful in itself). But sometimes the database doesn‘t give us everything we need natively and we need to run some code to fill in the gaps. At that point, though, you need to decide where the code is going to run. Should it run in Postgres or should it run on the application side?
If you decide you want the database to do the heavy lifting, Postgres offers stored procedures. Stored procedures are extremely powerful and can be used to do an enormous range of tasks, from performing complex mathematical operations that aren’t supported in SQL to triggering cascading series of events to pre-validating data before it‘s written to tables and far beyond. On the one hand, stored procedures can offer huge performance advantages. But the architectural costs can be high (and sometimes not worth it). You may avoid streaming thousands of rows to a client application, but you have also bound your application code to this database. And so the decision to use stored procedures should not be made lightly.
Caveats aside, let’s create a procedure (or FUNCTION) that simplifies INSERTing a new event at a venue without needing the venue_id. Here‘s what the procedure will accomplish: if the venue doesn’t exist, it will be created first and then referenced in the new event. The procedure will also return a Boolean indicating whether a new venue was added as a helpful bonus.
| CREATE OR REPLACE FUNCTION add_event( |
| title text, |
| starts timestamp, |
| ends timestamp, |
| venue text, |
| postal varchar(9), |
| country char(2)) |
| RETURNS boolean AS $$ |
| DECLARE |
| did_insert boolean := false; |
| found_count integer; |
| the_venue_id integer; |
| BEGIN |
| SELECT venue_id INTO the_venue_id |
| FROM venues v |
| WHERE v.postal_code=postal AND v.country_code=country AND v.name ILIKE venue |
| LIMIT 1; |
| |
| IF the_venue_id IS NULL THEN |
| INSERT INTO venues (name, postal_code, country_code) |
| VALUES (venue, postal, country) |
| RETURNING venue_id INTO the_venue_id; |
| |
| did_insert := true; |
| END IF; |
| |
| -- Note: this is a notice, not an error as in some programming languages |
| RAISE NOTICE 'Venue found %', the_venue_id; |
| |
| INSERT INTO events (title, starts, ends, venue_id) |
| VALUES (title, starts, ends, the_venue_id); |
| |
| RETURN did_insert; |
| END; |
| $$ LANGUAGE plpgsql; |
You can import this external file into the current schema using the following command-line argument (if you don’t feel like typing all that code).
| 7dbs=# i add_event.sql |
This stored procedure is run as a SELECT statement.
| SELECT add_event('House Party', '2018-05-03 23:00', |
| '2018-05-04 02:00', 'Run''s House', '97206', 'us'); |
Running it should return t (true) because this is the first use of the venue Run’s House. This saves a client two round-trip SQL commands to the database (a SELECT and then an INSERT) and instead performs only one.
The language we used in the procedure we wrote is PL/pgSQL (which stands for Procedural Language/PostgreSQL). Covering the details of an entire programming language is beyond the scope of this book, but you can read much more about it in the online PostgreSQL documentation.[6]
In addition to PL/pgSQL, Postgres supports three more core languages for writing procedures: Tcl (PL/Tcl), Perl (PL/Perl), and Python (PL/Python). People have written extensions for a dozen more, including Ruby, Java, PHP, Scheme, and others listed in the public documentation. Try this shell command:
| $ createlang 7dbs --list |
It will list the languages installed in your database. The createlang command is also used to add new languages, which you can find online.[7]
Triggers automatically fire stored procedures when some event happens, such as an insert or update. They allow the database to enforce some required behavior in response to changing data.
Let’s create a new PL/pgSQL function that logs whenever an event is updated (we want to be sure no one changes an event and tries to deny it later). First, create a logs table to store event changes. A primary key isn’t necessary here because it’s just a log.
| CREATE TABLE logs ( |
| event_id integer, |
| old_title varchar(255), |
| old_starts timestamp, |
| old_ends timestamp, |
| logged_at timestamp DEFAULT current_timestamp |
| ); |
Next, we build a function to insert old data into the log. The OLD variable represents the row about to be changed (NEW represents an incoming row, which we’ll see in action soon enough). Output a notice to the console with the event_id before returning.
| CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$ |
| DECLARE |
| BEGIN |
| INSERT INTO logs (event_id, old_title, old_starts, old_ends) |
| VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends); |
| RAISE NOTICE 'Someone just changed event #%', OLD.event_id; |
| RETURN NEW; |
| END; |
| $$ LANGUAGE plpgsql; |
Finally, we create our trigger to log changes after any row is updated.
| CREATE TRIGGER log_events |
| AFTER UPDATE ON events |
| FOR EACH ROW EXECUTE PROCEDURE log_event(); |
So, it turns out our party at Run’s House has to end earlier than we hoped. Let’s change the event.
| UPDATE events |
| SET ends='2018-05-04 01:00:00' |
| WHERE title='House Party'; |
| |
| NOTICE: Someone just changed event #9 |
And the old end time was logged.
| SELECT event_id, old_title, old_ends, logged_at |
| FROM logs; |
| |
| event_id | old_title | old_ends | logged_at |
| ---------+-------------+---------------------+------------------------ |
| 9 | House Party | 2018-05-04 02:00:00 | 2017-02-26 15:50:31.939 |
Triggers can also be created before updates and before or after inserts.[8]
Wouldn’t it be great if you could use the results of a complex query just like any other table? Well, that’s exactly what VIEWs are for. Unlike stored procedures, these aren’t functions being executed but rather aliased queries. Let’s say that we wanted to see only holidays that contain the word Day and have no venue. We could create a VIEW for that like this:
| CREATE VIEW holidays AS |
| SELECT event_id AS holiday_id, title AS name, starts AS date |
| FROM events |
| WHERE title LIKE '%Day%' AND venue_id IS NULL; |
Creating a view really is as simple as writing a query and prefixing it with CREATE VIEW some_view_name AS. Now you can query holidays like any other table. Under the covers it’s the plain old events table. As proof, add Valentine’s Day on 2018-02-14 to events and query the holidays view.
| SELECT name, to_char(date, 'Month DD, YYYY') AS date |
| FROM holidays |
| WHERE date <= '2018-04-01'; |
| |
| name | date |
| ------------------+-------------------- |
| April Fools Day | April 01, 2018 |
| Valentine's Day | February 14, 2018 |
Views are powerful tools for opening up complex queried data in a simple way. The query may be a roiling sea of complexity underneath, but all you see is a table.
If you want to add a new column to the holidays view, it will have to come from the underlying table. Let’s alter the events table to have an array of associated colors.
| ALTER TABLE events |
| ADD colors text ARRAY; |
Because holidays are to have colors associated with them, let’s update the VIEW query to contain the colors array.
| CREATE OR REPLACE VIEW holidays AS |
| SELECT event_id AS holiday_id, title AS name, starts AS date, colors |
| FROM events |
| WHERE title LIKE '%Day%' AND venue_id IS NULL; |
Now it’s a matter of setting an array or color strings to the holiday of choice. Unfortunately, we cannot update a VIEW directly.
| UPDATE holidays SET colors = '{"red","green"}' where name = 'Christmas Day'; |
| |
| ERROR: cannot update a view |
| HINT: You need an unconditional ON UPDATE DO INSTEAD rule. |
Looks like we need a RULE instead of a view.
A RULE is a description of how to alter the parsed query tree. Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).
Operators and values become branches and leaves in the tree, and the tree is walked, pruned, and in other ways edited before execution. This tree is optionally rewritten by Postgres rules, before being sent on to the query planner (which also rewrites the tree to run optimally), and sends this final command to be executed. See how SQL gets executed in PostgreSQL in the figure.
In fact, a VIEW such as holidays is a RULE. We can prove this by taking a look at the execution plan of the holidays view using the EXPLAIN command (notice Filter is the WHERE clause, and Output is the column list).
| EXPLAIN VERBOSE |
| SELECT * |
| FROM holidays; |
| |
| QUERY PLAN |
| ----------------------------------------------------------------------------- |
| Seq Scan on public.events (cost=0.00..1.01 rows=1 width=44) |
| Output: events.event_id, events.title, events.starts, events.colors |
| Filter: ((events.venue_id IS NULL) AND |
| ((events.title)::text ~~ '%Day%'::text)) |
Compare that to running EXPLAIN VERBOSE on the query from which we built the holidays VIEW. They’re functionally identical.
| EXPLAIN VERBOSE |
| SELECT event_id AS holiday_id, |
| title AS name, starts AS date, colors |
| FROM events |
| WHERE title LIKE '%Day%' AND venue_id IS NULL; |
| |
| QUERY PLAN |
| ----------------------------------------------------------------------------- |
| Seq Scan on public.events (cost=0.00..1.04 rows=1 width=57) |
| Output: event_id, title, starts, colors |
| Filter: ((events.venue_id IS NULL) AND |
| ((events.title)::text ~~ '%Day%'::text)) |
So, to allow updates against our holidays view, we need to craft a RULE that tells Postgres what to do with an UPDATE. Our rule will capture updates to the holidays view and instead run the update on events, pulling values from the pseudorelations NEW and OLD. NEW functionally acts as the relation containing the values we’re setting, while OLD contains the values we query by.
| CREATE RULE update_holidays AS ON UPDATE TO holidays DO INSTEAD |
| UPDATE events |
| SET title = NEW.name, |
| starts = NEW.date, |
| colors = NEW.colors |
| WHERE title = OLD.name; |
With this rule in place, now we can update holidays directly.
| UPDATE holidays SET colors = '{"red","green"}' where name = 'Christmas Day'; |
Next, let’s insert New Years Day on 2013-01-01 into holidays. As expected, we need a rule for that too. No problem.
| CREATE RULE insert_holidays AS ON INSERT TO holidays DO INSTEAD |
| INSERT INTO ... |
We’re going to move on from here, but if you’d like to play more with RULEs, try to add a DELETE RULE.
For our last exercise of the day, we’re going to build a monthly calendar of events, where each month in the calendar year counts the number of events in that month. This kind of operation is commonly done by a pivot table. These constructs “pivot” grouped data around some other output, in our case a list of months. We’ll build our pivot table using the crosstab function.
Start by crafting a query to count the number of events per month each year. PostgreSQL provides an extract function that returns some subfield from a date or timestamp, which aids in our grouping.
| SELECT extract(year from starts) as year, |
| extract(month from starts) as month, count(*) |
| FROM events |
| GROUP BY year, month |
| ORDER BY year, month; |
To use crosstab, the query must return three columns: rowid, category, and value. We’ll be using the year as an ID, which means the other fields are category (the month) and value (the count).
The crosstab function needs another set of values to represent months. This is how the function knows how many columns we need. These are the values that become the columns (the table to pivot against). So let’s create a table to store a list of numbers. Because we‘ll only need the table for a few operations, we‘ll create an ephemeral table, which lasts only as long as the current Postgres session, using the CREATE TEMPORARY TABLE command.
| CREATE TEMPORARY TABLE month_count(month INT); |
| INSERT INTO month_count VALUES (1),(2),(3),(4),(5), |
| (6),(7),(8),(9),(10),(11),(12); |
Now we’re ready to call crosstab with our two queries.
| SELECT * FROM crosstab( |
| 'SELECT extract(year from starts) as year, |
| extract(month from starts) as month, count(*) |
| FROM events |
| GROUP BY year, month |
| ORDER BY year, month', |
| 'SELECT * FROM month_count' |
| ); |
| |
| ERROR: a column definition list is required for functions returning "record" |
Oops. An error occurred. This cryptic error is basically saying that the function is returning a set of records (rows) but it doesn’t know how to label them. In fact, it doesn’t even know what datatypes they are.
Remember, the pivot table is using our months as categories, but those months are just integers. So, we define them like this:
| SELECT * FROM crosstab( |
| 'SELECT extract(year from starts) as year, |
| extract(month from starts) as month, count(*) |
| FROM events |
| GROUP BY year, month |
| ORDER BY year, month', |
| 'SELECT * FROM month_count' |
| ) AS ( |
| year int, |
| jan int, feb int, mar int, apr int, may int, jun int, |
| jul int, aug int, sep int, oct int, nov int, dec int |
| ) ORDER BY YEAR; |
We have one column year (which is the row ID) and twelve more columns representing the months.
| year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
| -----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----- |
| 2018 | | 5 | | 1 | 1 | | | | | | | 1 |
Go ahead and add a couple more events on another year just to see next year’s event counts. Run the crosstab function again, and enjoy the calendar.
Today finalized the basics of PostgreSQL. What we’re starting to see is that Postgres is more than just a server for storing vanilla datatypes and querying them. Instead, it’s a powerful data management engine that can reformat output data, store weird datatypes such as arrays, execute logic, and provide enough power to rewrite incoming queries.
Find
Do
Create a rule that captures DELETEs on venues and instead sets the active flag (created in the Day 1 homework) to FALSE.
A temporary table was not the best way to implement our event calendar pivot table. The generate_series(a, b) function returns a set of records, from a to b. Replace the month_count table SELECT with this.
Build a pivot table that displays every day in a single month, where each week of the month is a row and each day name forms a column across the top (seven days, starting with Sunday and ending with Saturday) like a standard month calendar. Each day should contain a count of the number of events for that date or should remain blank if no event occurs.