Day 2: Advanced Queries, Code, and Rules

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Day 2: Advanced Queries, Code, and Rules

Yesterday we saw how to define tables, populate them with data, update and delete rows, and perform basic reads. Today we’ll dig even deeper into the myriad ways that PostgreSQL can query data. We’ll see how to group similar values, execute code on the server, and create custom interfaces using views and rules. We’ll finish the day by using one of PostgreSQL’s contributed packages to flip tables on their heads.

Aggregate Functions

An aggregate query groups results from several rows by some common criteria. It can be as simple as counting the number of rows in a table or calculating the average of some numerical column. They’re powerful SQL tools and also a lot of fun.

Let’s try some aggregate functions, but first we’ll need some more data in our database. Enter your own country into the countries table, your own city into the cities table, and your own address as a venue (which we just named My Place). Then add a few records to the events table.

Here’s a quick SQL tip: Rather than setting the venue_id explicitly, you can sub-SELECT it using a more human-readable title. If Moby is playing at the Crystal Ballroom, set the venue_id like this:

	INSERT INTO events (title, starts, ends, venue_id)
	VALUES ('Moby', '2018-02-06 21:00', '2018-02-06 23:00', (
	SELECT venue_id
	FROM venues
	WHERE name = 'Crystal Ballroom'
	)
	);

Populate your events table with the following data (to enter Valentine’s Day in PostgreSQL, you can escape the apostrophe with two, such as Heaven”s Gate):

	title \| starts \| ends \| venue
	-----------------+---------------------+---------------------+---------------
	Wedding \| 2018-02-26 21:00:00 \| 2018-02-26 23:00:00 \| Voodoo Doughnut
	Dinner with Mom \| 2018-02-26 18:00:00 \| 2018-02-26 20:30:00 \| My Place
	Valentine's Day \| 2018-02-14 00:00:00 \| 2018-02-14 23:59:00 \|

With our data set up, let’s try some aggregate queries. The simplest aggregate function is count, which is fairly self-explanatory. Counting all titles that contain the word Day (note: % is a wildcard on LIKE searches), you should receive a value of 3.

	SELECT count(title)
	FROM events
	WHERE title LIKE '%Day%';

To get the first start time and last end time of all events at the Crystal Ballroom, use min (return the smallest value) and max (return the largest value).

	SELECT min(starts), max(ends)
	FROM events INNER JOIN venues
	ON events.venue_id = venues.venue_id
	WHERE venues.name = 'Crystal Ballroom';

	min \| max
	---------------------+---------------------
	2018-02-06 21:00:00 \| 2018-02-06 23:00:00

Aggregate functions are useful but limited on their own. If we wanted to count all events at each venue, we could write the following for each venue ID:

	SELECT count() FROM* events WHERE venue_id = 1;
	SELECT count() FROM* events WHERE venue_id = 2;
	SELECT count() FROM* events WHERE venue_id = 3;
	SELECT count() FROM* events WHERE venue_id IS NULL;

This would be tedious (intractable even) as the number of venues grows. This is where the GROUP BY command comes in handy.

Grouping

GROUP BY is a shortcut for running the previous queries all at once. With GROUP BY, you tell Postgres to place the rows into groups and then perform some aggregate function (such as count) on those groups.

	SELECT venue_id, count(*)
	FROM events
	GROUP BY venue_id;

	venue_id \| count
	----------+-------
	1 \| 1
	2 \| 2
	3 \| 1
	4 \| 3

It’s a nice list, but can we filter by the count function? Absolutely. The GROUP BY condition has its own filter keyword: HAVING. HAVING is like the WHERE clause, except it can filter by aggregate functions (whereas WHERE cannot).

The following query SELECTs the most popular venues, those with two or more events:

	SELECT venue_id
	FROM events
	GROUP BY venue_id
	HAVING count() >= 2 AND* venue_id IS NOT NULL;

	venue_id \| count
	----------+-------
	2 \| 2

You can use GROUP BY without any aggregate functions. If you call SELECT... FROM...GROUP BY on one column, you get only unique values.

SELECT venue_id FROM events GROUP BY venue_id;

This kind of grouping is so common that SQL has a shortcut for it in the DISTINCT keyword.

SELECT DISTINCT venue_id FROM events;

The results of both queries will be identical.

Window Functions

If you’ve done any sort of production work with a relational database in the past, you are likely familiar with aggregate queries. They are a common SQL staple. Window functions, on the other hand, are not quite so common (PostgreSQL is one of the few open source databases to implement them).

Window functions are similar to GROUP BY queries in that they allow you to run aggregate functions across multiple rows. The difference is that they allow you to use built-in aggregate functions without requiring every single field to be grouped to a single row.

If we attempt to select the title column without grouping by it, we can expect an error.

	SELECT title, venue_id, count(*)
	FROM events
	GROUP BY venue_id;

	ERROR: column "events.title" must appear in the GROUP BY clause or
	be used in an aggregate function

We are counting up the rows by venue_id, and in the case of Fight Club and Wedding, we have two titles for a single venue_id. Postgres doesn’t know which title to display.

Whereas a GROUP BY clause will return one record per matching group value, a window function, which does not collapse the results per group, can return a separate record for each row. For a visual representation, see the following figure.

Let’s see an example of the sweet spot that window functions attempt to hit.

Window functions return all matches and replicate the results of any aggregate function.

	SELECT title, count() OVER (PARTITION* BY venue_id) FROM events;

	title \| count
	-------------+-------
	Moby \| 1
	Fight Club \| 1
	House Party \| 3
	House Party \| 3
	House Party \| 3
	(5 rows)

We like to think of PARTITION BY as akin to GROUP BY, but rather than grouping the results outside of the SELECT attribute list (and thus combining the results into fewer rows), it returns grouped values as any other field (calculating on the grouped variable but otherwise just another attribute). Or in SQL parlance, it returns the results of an aggregate function OVER a PARTITION of the result set.

Transactions

Transactions are the bulwark of relational database consistency. All or nothing, that’s the transaction motto. Transactions ensure that every command of a set is executed. If anything fails along the way, all of the commands are rolled back as if they never happened.

PostgreSQL transactions follow ACID compliance, which stands for:

Atomic (either all operations succeed or none do)
Consistent (the data will always be in a good state and never in an inconsistent state)
Isolated (transactions don’t interfere with one another)
Durable (a committed transaction is safe, even after a server crash)

We should note here that consistency in ACID is different from consistency in CAP (covered in Appendix 2, The CAP Theorem).

We can wrap any transaction within a BEGIN TRANSACTION block. To verify atomicity, we’ll kill the transaction with the ROLLBACK command.

	BEGIN TRANSACTION;
	DELETE FROM events;
	ROLLBACK;
	SELECT * FROM events;

The events all remain. Transactions are useful when you’re modifying two tables that you don’t want out of sync. The classic example is a debit/credit system for a bank, where money is moved from one account to another:

	BEGIN TRANSACTION;
	UPDATE account SET total=total+5000.0 WHERE account_id=1337;
	UPDATE account SET total=total-5000.0 WHERE account_id=45887;
	END;

If something happened between the two updates, this bank just lost five grand. But when wrapped in a transaction block, the initial update is rolled back, even if the server explodes.

Stored Procedures

Every command we’ve seen until now has been declarative in the sense that we‘ve been able to get our desired result set using just SQL (which is quite powerful in itself). But sometimes the database doesn‘t give us everything we need natively and we need to run some code to fill in the gaps. At that point, though, you need to decide where the code is going to run. Should it run in Postgres or should it run on the application side?

If you decide you want the database to do the heavy lifting, Postgres offers stored procedures. Stored procedures are extremely powerful and can be used to do an enormous range of tasks, from performing complex mathematical operations that aren’t supported in SQL to triggering cascading series of events to pre-validating data before it‘s written to tables and far beyond. On the one hand, stored procedures can offer huge performance advantages. But the architectural costs can be high (and sometimes not worth it). You may avoid streaming thousands of rows to a client application, but you have also bound your application code to this database. And so the decision to use stored procedures should not be made lightly.

Caveats aside, let’s create a procedure (or FUNCTION) that simplifies INSERTing a new event at a venue without needing the venue_id. Here‘s what the procedure will accomplish: if the venue doesn’t exist, it will be created first and then referenced in the new event. The procedure will also return a Boolean indicating whether a new venue was added as a helpful bonus.

When relational databases hit their heyday, they were the Swiss Army knife of technologies. You could store nearly anything—even programming entire projects in them (for example, Microsoft Access). The few companies that provided this software promoted use of proprietary differences and then took advantage of this corporate reliance by charging enormous license and consulting fees. This was the dreaded vendor lock-in that newer programming methodologies tried to mitigate in the 1990s and early 2000s.

The zeal to neuter the vendors, however, birthed maxims such as no logic in the database. This is a shame because relational databases are capable of so many varied data management options. Vendor lock-in has not disappeared. Many actions we investigate in this book are highly implementation-specific. However, it’s worth knowing how to use databases to their fullest extent before deciding to skip tools such as stored procedures solely because they’re implementation-specific.

postgres/add_event.sql

	CREATE OR REPLACE FUNCTION add_event(
	title text,
	starts timestamp,
	ends timestamp,
	venue text,
	postal varchar(9),
	country char(2))
	RETURNS boolean AS $$
	DECLARE
	did_insert boolean := false;
	found_count integer;
	the_venue_id integer;
	BEGIN
	SELECT venue_id INTO the_venue_id
	FROM venues v
	WHERE v.postal_code=postal AND v.country_code=country AND v.name ILIKE venue
	LIMIT 1;

	IF the_venue_id IS NULL THEN
	INSERT INTO venues (name, postal_code, country_code)
	VALUES (venue, postal, country)
	RETURNING venue_id INTO the_venue_id;

	did_insert := true;
	END IF;

	-- Note: this is a notice, not an error as in some programming languages
	RAISE NOTICE 'Venue found %', the_venue_id;

	INSERT INTO events (title, starts, ends, venue_id)
	VALUES (title, starts, ends, the_venue_id);

	RETURN did_insert;
	END;
	$$ LANGUAGE plpgsql;

You can import this external file into the current schema using the following command-line argument (if you don’t feel like typing all that code).

7dbs=# i add_event.sql

This stored procedure is run as a SELECT statement.

	SELECT add_event('House Party', '2018-05-03 23:00',
	'2018-05-04 02:00', 'Run''s House', '97206', 'us');

Running it should return t (true) because this is the first use of the venue Run’s House. This saves a client two round-trip SQL commands to the database (a SELECT and then an INSERT) and instead performs only one.

This is the first of a number of places where you’ll see this theme in this book: Does the code belong in your application or in the database? It’s often a difficult decision, one that you’ll have to answer uniquely for every application.

In many cases, you’ll improve performance by as much as an order of magnitude. For example, you might have a complex application-specific calculation that requires custom code. If the calculation involves many rows, a stored procedure will save you from moving thousands of rows instead of a single result. The cost is splitting your application, your code, and your tests across two different programming paradigms.

The language we used in the procedure we wrote is PL/pgSQL (which stands for Procedural Language/PostgreSQL). Covering the details of an entire programming language is beyond the scope of this book, but you can read much more about it in the online PostgreSQL documentation.^[6]

In addition to PL/pgSQL, Postgres supports three more core languages for writing procedures: Tcl (PL/Tcl), Perl (PL/Perl), and Python (PL/Python). People have written extensions for a dozen more, including Ruby, Java, PHP, Scheme, and others listed in the public documentation. Try this shell command:

$ createlang 7dbs --list

It will list the languages installed in your database. The createlang command is also used to add new languages, which you can find online.^[7]

Pull the Triggers

Triggers automatically fire stored procedures when some event happens, such as an insert or update. They allow the database to enforce some required behavior in response to changing data.

Let’s create a new PL/pgSQL function that logs whenever an event is updated (we want to be sure no one changes an event and tries to deny it later). First, create a logs table to store event changes. A primary key isn’t necessary here because it’s just a log.

	CREATE TABLE logs (
	event_id integer,
	old_title varchar(255),
	old_starts timestamp,
	old_ends timestamp,
	logged_at timestamp DEFAULT current_timestamp
	);

Next, we build a function to insert old data into the log. The OLD variable represents the row about to be changed (NEW represents an incoming row, which we’ll see in action soon enough). Output a notice to the console with the event_id before returning.

postgres/log_event.sql

	CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$
	DECLARE
	BEGIN
	INSERT INTO logs (event_id, old_title, old_starts, old_ends)
	VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends);
	RAISE NOTICE 'Someone just changed event #%', OLD.event_id;
	RETURN NEW;
	END;
	$$ LANGUAGE plpgsql;

Finally, we create our trigger to log changes after any row is updated.

	CREATE TRIGGER log_events
	AFTER UPDATE ON events
	FOR EACH ROW EXECUTE PROCEDURE log_event();

So, it turns out our party at Run’s House has to end earlier than we hoped. Let’s change the event.

	UPDATE events
	SET ends='2018-05-04 01:00:00'
	WHERE title='House Party';

	NOTICE: Someone just changed event #9

And the old end time was logged.

	SELECT event_id, old_title, old_ends, logged_at
	FROM logs;

	event_id \| old_title \| old_ends \| logged_at
	---------+-------------+---------------------+------------------------
	9 \| House Party \| 2018-05-04 02:00:00 \| 2017-02-26 15:50:31.939

Triggers can also be created before updates and before or after inserts.^[8]

Viewing the World

Wouldn’t it be great if you could use the results of a complex query just like any other table? Well, that’s exactly what VIEWs are for. Unlike stored procedures, these aren’t functions being executed but rather aliased queries. Let’s say that we wanted to see only holidays that contain the word Day and have no venue. We could create a VIEW for that like this:

postgres/holiday_view_1.sql

	CREATE VIEW holidays AS
	SELECT event_id AS holiday_id, title AS name, starts AS date
	FROM events
	WHERE title LIKE '%Day%' AND venue_id IS NULL;

Creating a view really is as simple as writing a query and prefixing it with CREATE VIEW some_view_name AS. Now you can query holidays like any other table. Under the covers it’s the plain old events table. As proof, add Valentine’s Day on 2018-02-14 to events and query the holidays view.

	SELECT name, to_char(date, 'Month DD, YYYY') AS date
	FROM holidays
	WHERE date <= '2018-04-01';

	name \| date
	------------------+--------------------
	April Fools Day \| April 01, 2018
	Valentine's Day \| February 14, 2018

Views are powerful tools for opening up complex queried data in a simple way. The query may be a roiling sea of complexity underneath, but all you see is a table.

Though VIEWs like the holidays view mentioned previously are a convenient abstraction, they don’t yield any performance gains over the SELECT queries that they alias. If you want VIEWs that do offer such gains, you should consider creating materialized views, which are different because they’re stored on disk in a “real” table and thus yield performance gains because they restrict the number of tables that must be accessed to exactly one.

You can create materialized views just like ordinary views, except with a CREATE MATERIALIZED VIEW rather than CREATE VIEW statement. Materialized view tables are populated whenever you run the REFRESH command for them, which you can automate to run at defined intervals or in response to triggers. You can also create indexes on materialized views the same way that you can on regular tables.

The downside of materialized views is that they do increase disk space usage. But in many cases, the performance gains are worth it. In general, the more complex the query and the more tables it spans, the more performance gains you’re likely to get vis-à-vis plain old SELECT queries or VIEWs.

If you want to add a new column to the holidays view, it will have to come from the underlying table. Let’s alter the events table to have an array of associated colors.

	ALTER TABLE events
	ADD colors text ARRAY;

Because holidays are to have colors associated with them, let’s update the VIEW query to contain the colors array.

	CREATE OR REPLACE VIEW holidays AS
	SELECT event_id AS holiday_id, title AS name, starts AS date, colors
	FROM events
	WHERE title LIKE '%Day%' AND venue_id IS NULL;

Now it’s a matter of setting an array or color strings to the holiday of choice. Unfortunately, we cannot update a VIEW directly.

	UPDATE holidays SET colors = '{"red","green"}' where name = 'Christmas Day';

	ERROR: cannot update a view
	HINT: You need an unconditional ON UPDATE DO INSTEAD rule.

Looks like we need a RULE instead of a view.

What RULEs the School?

A RULE is a description of how to alter the parsed query tree. Every time Postgres runs an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).

Operators and values become branches and leaves in the tree, and the tree is walked, pruned, and in other ways edited before execution. This tree is optionally rewritten by Postgres rules, before being sent on to the query planner (which also rewrites the tree to run optimally), and sends this final command to be executed. See how SQL gets executed in PostgreSQL in the figure.

In fact, a VIEW such as holidays is a RULE. We can prove this by taking a look at the execution plan of the holidays view using the EXPLAIN command (notice Filter is the WHERE clause, and Output is the column list).

	EXPLAIN VERBOSE
	SELECT *
	FROM holidays;

	QUERY PLAN
	-----------------------------------------------------------------------------
	Seq Scan on public.events (cost=0.00..1.01 rows=1 width=44)
	Output: events.event_id, events.title, events.starts, events.colors
	Filter: ((events.venue_id IS NULL) AND
	((events.title)::text ~~ '%Day%'::text))

Compare that to running EXPLAIN VERBOSE on the query from which we built the holidays VIEW. They’re functionally identical.

	EXPLAIN VERBOSE
	SELECT event_id AS holiday_id,
	title AS name, starts AS date, colors
	FROM events
	WHERE title LIKE '%Day%' AND venue_id IS NULL;

	QUERY PLAN
	-----------------------------------------------------------------------------
	Seq Scan on public.events (cost=0.00..1.04 rows=1 width=57)
	Output: event_id, title, starts, colors
	Filter: ((events.venue_id IS NULL) AND
	((events.title)::text ~~ '%Day%'::text))

So, to allow updates against our holidays view, we need to craft a RULE that tells Postgres what to do with an UPDATE. Our rule will capture updates to the holidays view and instead run the update on events, pulling values from the pseudorelations NEW and OLD. NEW functionally acts as the relation containing the values we’re setting, while OLD contains the values we query by.

postgres/create_rule.sql

	CREATE RULE update_holidays AS ON UPDATE TO holidays DO INSTEAD
	UPDATE events
	SET title = NEW.name,
	starts = NEW.date,
	colors = NEW.colors
	WHERE title = OLD.name;

With this rule in place, now we can update holidays directly.

UPDATE holidays SET colors = '{"red","green"}' where name = 'Christmas Day';

Next, let’s insert New Years Day on 2013-01-01 into holidays. As expected, we need a rule for that too. No problem.

	CREATE RULE insert_holidays AS ON INSERT TO holidays DO INSTEAD
	INSERT INTO ...

We’re going to move on from here, but if you’d like to play more with RULEs, try to add a DELETE RULE.

I’ll Meet You at the Crosstab

For our last exercise of the day, we’re going to build a monthly calendar of events, where each month in the calendar year counts the number of events in that month. This kind of operation is commonly done by a pivot table. These constructs “pivot” grouped data around some other output, in our case a list of months. We’ll build our pivot table using the crosstab function.

Start by crafting a query to count the number of events per month each year. PostgreSQL provides an extract function that returns some subfield from a date or timestamp, which aids in our grouping.

	SELECT extract(year from starts) as year,
	extract(month from starts) as month, count(*)
	FROM events
	GROUP BY year, month
	ORDER BY year, month;

To use crosstab, the query must return three columns: rowid, category, and value. We’ll be using the year as an ID, which means the other fields are category (the month) and value (the count).

The crosstab function needs another set of values to represent months. This is how the function knows how many columns we need. These are the values that become the columns (the table to pivot against). So let’s create a table to store a list of numbers. Because we‘ll only need the table for a few operations, we‘ll create an ephemeral table, which lasts only as long as the current Postgres session, using the CREATE TEMPORARY TABLE command.

	CREATE TEMPORARY TABLE month_count(month INT);
	INSERT INTO month_count VALUES (1),(2),(3),(4),(5),
	(6),(7),(8),(9),(10),(11),(12);

Now we’re ready to call crosstab with our two queries.

	SELECT * FROM crosstab(
	'SELECT extract(year from starts) as year,
	extract(month from starts) as month, count()*
	FROM events
	GROUP BY year, month
	ORDER BY year, month',
	'SELECT FROM month_count'*
	);

	ERROR: a column definition list is required for functions returning "record"

Oops. An error occurred. This cryptic error is basically saying that the function is returning a set of records (rows) but it doesn’t know how to label them. In fact, it doesn’t even know what datatypes they are.

Remember, the pivot table is using our months as categories, but those months are just integers. So, we define them like this:

	SELECT * FROM crosstab(
	'SELECT extract(year from starts) as year,
	extract(month from starts) as month, count()*
	FROM events
	GROUP BY year, month
	ORDER BY year, month',
	'SELECT FROM month_count'*
	) AS (
	year int,
	jan int, feb int, mar int, apr int, may int, jun int,
	jul int, aug int, sep int, oct int, nov int, dec int
	) ORDER BY YEAR;

We have one column year (which is the row ID) and twelve more columns representing the months.

	year \| jan \| feb \| mar \| apr \| may \| jun \| jul \| aug \| sep \| oct \| nov \| dec
	-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
	2018 \| \| 5 \| \| 1 \| 1 \| \| \| \| \| \| \| 1

Go ahead and add a couple more events on another year just to see next year’s event counts. Run the crosstab function again, and enjoy the calendar.

Day 2 Wrap-Up

Today finalized the basics of PostgreSQL. What we’re starting to see is that Postgres is more than just a server for storing vanilla datatypes and querying them. Instead, it’s a powerful data management engine that can reformat output data, store weird datatypes such as arrays, execute logic, and provide enough power to rewrite incoming queries.

Day 2 Homework

Find

Find the list of aggregate functions in the PostgreSQL docs.
Find a GUI program to interact with PostgreSQL, such as pgAdmin, Datagrip, or Navicat.

Create a rule that captures DELETEs on venues and instead sets the active flag (created in the Day 1 homework) to FALSE.
A temporary table was not the best way to implement our event calendar pivot table. The generate_series(a, b) function returns a set of records, from a to b. Replace the month_count table SELECT with this.
Build a pivot table that displays every day in a single month, where each week of the month is a row and each day name forms a column across the top (seven days, starting with Sunday and ending with Saturday) like a standard month calendar. Each day should contain a count of the number of events for that date or should remain blank if no event occurs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

	title \| starts \| ends \| venue
	-----------------+---------------------+---------------------+---------------
	Wedding \| 2018-02-26 21:00:00 \| 2018-02-26 23:00:00 \| Voodoo Doughnut
	Dinner with Mom \| 2018-02-26 18:00:00 \| 2018-02-26 20:30:00 \| My Place
	Valentine's Day \| 2018-02-14 00:00:00 \| 2018-02-14 23:59:00 \|

	SELECT venue_id, count(*)
	FROM events
	GROUP BY venue_id;

	venue_id \| count
	----------+-------
	1 \| 1
	2 \| 2
	3 \| 1
	4 \| 3

	SELECT title, count() OVER (PARTITION* BY venue_id) FROM events;

	title \| count
	-------------+-------
	Moby \| 1
	Fight Club \| 1
	House Party \| 3
	House Party \| 3
	House Party \| 3
	(5 rows)

Table of Contents for Day 2: Advanced Queries, Code, and Rules

Create new playlist

Sign In

Sign Up

Day 2: Advanced Queries, Code, and Rules

Aggregate Functions

Grouping

Window Functions

Transactions

Stored Procedures

Pull the Triggers

Viewing the World

What RULEs the School?

I’ll Meet You at the Crosstab

Day 2 Wrap-Up

Day 2 Homework

Table of Contents for
Day 2: Advanced Queries, Code, and Rules