PostgreSQL surpasses other database products in ANSI SQL compliance. It cements its lead by adding constructs that range from convenient syntax shorthands to avant-garde features that break the bounds of traditional SQL. In this chapter, we’ll cover some SQL tidbits not often found in other databases. For this chapter, you should have a working knowledge of SQL; otherwise, you may not appreciate the labor-saving amuse-bouche that PostgreSQL brings to the table.
Well-designed relational databases store data in normalized form. To access this data across scattered tables, you write queries to join underlying tables. When you find yourself writing the same query over and over again, create a view. Simply put, a view is nothing more than a query permanently stored in the database.
Some purists have argued that one should always query a view, never tables. This means you must create a view for every table that you intend to query directly. The added layer of indirection eases management of permissions and facilitates abstraction of table data. We find this to be sound advice, but laziness gets the better of us.
Views in PostgreSQL have evolved over the years. Version 9.3 unveiled automatically updatable views. If your view draws from a single table and you include the primary key as an output column, you can issue an update command directly against your view. Data in the underlying table will follow suit.
Version 9.3 also introduced materialized views. When you mark a view as materialized, it
will requery the data only when you issue the REFRESH
command. The upside is that
you’re not wasting resources running complex queries repeatedly; the
downside is that you might not have the most up-to-date data when you use
the view. Furthermore, under some
circumstances you are barred from access to the view during a
refresh.
Version 9.4 allows users to access materialized views during
refreshes. It also introduced the WITH CHECK OPTION
modifier, which prevents
inserts and updates outside the scope of the view.
The simplest view draws from a single table. Always include the primary key if you intend to write data back to the table, as shown in Example 7-1.
CREATE
OR
REPLACE
VIEW
census
.
vw_facts_2011
AS
SELECT
fact_type_id
,
val
,
yr
,
tract_id
FROM
census
.
facts
WHERE
yr
=
2011
;
As of version 9.3, you can alter the data in this view by using
INSERT
, UPDATE
, or DELETE
commands. Updates and deletes will abide by any WHERE
condition you have as part of your view. For example, the following
query will delete only records whose value
is 0:
DELETE
FROM
census
.
vw_facts_2011
WHERE
val
=
0
;
And the following will not update any records, because the view explicitly includes only records for 2011:
UPDATE
census
.
vw_facts_2011
SET
val
=
1
WHERE
yr
=
2012
;
Be aware that you can insert data that places it outside of the
view’s WHERE
or update data so it is no
longer visible from the view as shown in Example 7-2.
UPDATE
census
.
vw_facts_2011
SET
yr
=
2012
WHERE
yr
=
2011
;
The update of Example 7-2
does not violate the WHERE
condition. But, once executed,
you would have emptied your view. For the sake of sanity, you may find
it desirable to prevent updates or inserts that leave data invisible to
further queries. Version 9.4 introduced the WITH CHECK
OPTION
to accomplish this. Include this modifier when creating
the view and PostgreSQL will forever balk at any attempts to add records
outside the view and to update records that will put them outside the
view. In our example view, our goal is to limit vw_facts_2011 to allow
inserts only of 2011 data and disallow updates of the yr to something
other than 2011. To add this restriction, we revise our view definition
as shown in Example 7-3.
CREATE
OR
REPLACE
VIEW
census
.
vw_facts_2011
AS
SELECT
fact_type_id
,
val
,
yr
,
tract_id
FROM
census
.
facts
WHERE
yr
=
2011
WITH
CHECK
OPTION
;
Now try to run an update such as:
UPDATE
census
.
vw_facts_2011
SET
yr
=
2012
WHERE
val
>
2942
;
You’ll get an error:
ERROR
:
New
row
violates
WITH
CHECK
OPTION
for
view
"vw_facts_2011"
DETAIL
:
Failing
row
contains
(
1
,
25001010500
,
2012
,
2985
.
000
,
100
.
00
).
Views can encapsulate joins among tables. When a view draws from more than one table, updating the underlying data with a simple command is no longer possible. Drawing data from more than one table introduces inherent ambiguity when you’re trying to update the underlying data, and PostgreSQL is not about to make an arbitrary decision for you. For instance, if you have a view that joins a table of countries with a table of provinces, and then decide to delete one of the rows, PostgreSQL won’t know whether you intend to delete only a country, a province, or a particular country-province pairing. Nonetheless, you can still modify the underlying data through the view using triggers.
Let’s start by creating a view that pulls rows from the facts table and a lookup table, as shown in Example 7-4.
CREATE
OR
REPLACE
VIEW
census
.
vw_facts
AS
SELECT
y
.
fact_type_id
,
y
.
category
,
y
.
fact_subcats
,
y
.
short_name
,
x
.
tract_id
,
x
.
yr
,
x
.
val
,
x
.
perc
FROM
census
.
facts
As
x
INNER
JOIN
census
.
lu_fact_types
As
y
ON
x
.
fact_type_id
=
y
.
fact_type_id
;
To make this view updatable with a trigger, you can define one or
more INSTEAD OF
triggers. We first define the
trigger function to handle the trifecta: INSERT
,
UPDATE
, DELETE
. In addition, PostgreSQL
supports triggers on the TRUNCATE
event. You can use any language to write the function except
SQL
, and you’re free to name it whatever you like. We chose
PL/pgSQL in Example 7-5.
CREATE
OR
REPLACE
FUNCTION
census
.
trig_vw_facts_ins_upd_del
(
)
RETURNS
trigger
AS
$
$
BEGIN
IF
(
TG_OP
=
'DELETE'
)
THEN
DELETE
FROM
census
.
facts
AS
f
WHERE
f
.
tract_id
=
OLD
.
tract_id
AND
f
.
yr
=
OLD
.
yr
AND
f
.
fact_type_id
=
OLD
.
fact_type_id
;
RETURN
OLD
;
END
IF
;
IF
(
TG_OP
=
'INSERT'
)
THEN
INSERT
INTO
census
.
facts
(
tract_id
,
yr
,
fact_type_id
,
val
,
perc
)
SELECT
NEW
.
tract_id
,
NEW
.
yr
,
NEW
.
fact_type_id
,
NEW
.
val
,
NEW
.
perc
;
RETURN
NEW
;
END
IF
;
IF
(
TG_OP
=
'UPDATE'
)
THEN
IF
ROW
(
OLD
.
fact_type_id
,
OLD
.
tract_id
,
OLD
.
yr
,
OLD
.
val
,
OLD
.
perc
)
!
=
ROW
(
NEW
.
fact_type_id
,
NEW
.
tract_id
,
NEW
.
yr
,
NEW
.
val
,
NEW
.
perc
)
THEN
UPDATE
census
.
facts
AS
f
SET
tract_id
=
NEW
.
tract_id
,
yr
=
NEW
.
yr
,
fact_type_id
=
NEW
.
fact_type_id
,
val
=
NEW
.
val
,
perc
=
NEW
.
perc
WHERE
f
.
tract_id
=
OLD
.
tract_id
AND
f
.
yr
=
OLD
.
yr
AND
f
.
fact_type_id
=
OLD
.
fact_type_id
;
RETURN
NEW
;
ELSE
RETURN
NULL
;
END
IF
;
END
IF
;
END
;
$
$
LANGUAGE
plpgsql
VOLATILE
;
Handles deletes. Delete only records with matching keys in the
OLD
record.
Handles inserts.
Handles updates. Use the OLD
record to determine
which records to update. NEW
record has the new
data.
Update rows only if at least one of the columns from the facts table has changed.
Next, we bind the trigger function to the view, as shown in Example 7-6.
CREATE
TRIGGER
trig_01_vw_facts_ins_upd_del
INSTEAD
OF
INSERT
OR
UPDATE
OR
DELETE
ON
census
.
vw_facts
FOR
EACH
ROW
EXECUTE
PROCEDURE
census
.
trig_vw_facts_ins_upd_del
();
The binding syntax is uncharacteristically English-like.
Now when we update, delete, or insert into our view, we update the
underlying facts
table instead:
UPDATE
census
.
vw_facts
SET
yr
=
2012
WHERE
yr
=
2011
AND
tract_id
=
'25027761200'
;
Upon a successful update, PostgreSQL returns the following message:
Query returned successfully: 56 rows affected, 40 ms execution time.
If we try to update a field not in our update row comparison, the update will not take place:
UPDATE
census
.
vw_facts
SET
short_name
=
'test'
;
With a message:
Query returned successfully: 0 rows affected, 931 ms execution time.
Although this example created a single trigger function to handle multiple events, we could have just as easily created a separate trigger and trigger function for each event.
PostgreSQL has another approach for updating views called rules, which predates the
introduction of INSTEAD OF
triggers view support. You can
see an example using rules in Database Abstraction with Updatable
Views.
You can still use rules to update view data, but INSTEAD
OF
triggers are preferred now. Internally PostgreSQL still uses
rules to define the view (a view is nothing but an INSTEAD OF
SELECT
rule on a virtual table) and to implement single table
updatable views. The difference between using a trigger and a rule is
that a rule rewrites the underlying query and a trigger gets called for
each virtual row. As such, rules become overwhelmingly difficult to
write (and understand) when many tables are involved. Rules are also
limited because they can be written only in SQL, not in other procedural languages.
Materialized views cache the fetched data. This happens when you first create the view as well as
when you run the REFRESH MATERIALIZED VIEW
command.
To use materialized views, you need at least version
9.3.
The most convincing cases for using materialized views are when the underlying query takes a long time and when having timely data is not critical. You often encounter these scenarios when building online analytical processing (OLAP) applications.
Unlike nonmaterialized views, you can add indexes to materialized views to speed up the read.
Example 7-7 demonstrates how to make a materialized version of the view in Example 7-1.
CREATE
MATERIALIZED
VIEW
census
.
vw_facts_2011_materialized
AS
SELECT
fact_type_id
,
val
,
yr
,
tract_id
FROM
census
.
facts
WHERE
yr
=
2011
;
Create an index on a materialized view as you would do on a regular table, as shown in Example 7-8.
CREATE
UNIQUE
INDEX
ix
ON
census
.
vw_facts_2011_materialized
(
tract_id
,
fact_type_id
,
yr
);
For speedier access to a materialized view with a large number of
records, you may want to control the physical sort of the data. The
easiest way is to include an ORDER BY
when you create the
view. Alternatively, you can add a cluster index to the view. First,
create an index in the physical sort order you want to have. Then
run the CLUSTER
command,
passing it the index, as shown in Example 7-9.
CLUSTER
census
.
vw_facts_2011_materialized
USING
ix
;
CLUSTER
census
.
vw_facts_2011_materialized
;
Name the index to cluster on. Needed only during view creation.
Each time you refresh, you must recluster the data.
The advantage of using ORDER BY
in the materialized view over
using the CLUSTER approach is that the sort is maintained with each
REFRESH MATERIALIZED VIEW
call, alleviating the need to
recluster. The downside is that ORDER BY
generally adds
more processing time to the REFRESH
step of the view. You should test the effect of ORDER BY
on performance of REFRESH
before using it. One way to test is
just to run the underlying query of the view with an ORDER BY
clause.
To refresh the view in PostgreSQL 9.3, use:
REFRESH
MATERIALIZED
VIEW
census
.
vw_facts_2011_materialized
;
The view cannot be queried while the REFRESH MATERIALIZED VIEW step is running.
In PostgreSQL 9.4, to allow the view to be queried while it’s refreshing, you can use:
REFRESH
MATERIALIZED
VIEW
CONCURRENTLY
census
.
vw_facts_2011_materialized
;
Current limitations of materialized views include:
You can’t use CREATE OR REPLACE
to edit an
existing materialized view. You must drop and re-create the view
even for the most trivial of changes. Use DROP MATERIALIZED VIEW
. Annoyingly, you’ll
lose all your indexes.name_of_view
You need to run REFRESH MATERIALIZED VIEW
to
rebuild the cache. PostgreSQL doesn’t perform automatic recaching of
any kind. You need to resort to mechanisms such as crontab
, pgAgent jobs, or triggers to
automate any kind of refresh. We have an example using triggers in
Caching Data with Materialized
Views and Statement-Level Triggers.
Refreshing materialized views in version 9.3 is a blocking
operation, meaning that the view will not be accessible during the
refresh process. In version 9.4 you can lift this quarantine by
adding the CONCURRENTLY
keyword to your REFRESH command, provided that you have
established a unique index on your view. The trade-off is concurrent
refreshes could take longer to complete.
In our many years of writing SQL, we have come to appreciate the little things that make better use of our typing. Only PostgreSQL offers some of the gems we present in this section. Often this means that the construction is not ANSI-compliant. If thy God demands strict observance to the ANSI SQL standards, abstain from the short-cuts that we’ll be showing.
One of our favorites is DISTINCT ON
. It behaves
like DISTINCT
, but with two enhancements: you
can specify which columns to consider as distinct and to sort the
remaining columns. One little word—ON
—replaces numerous lines of additional code
to achieve the same result.
In Example 7-10, we demonstrate how to get the details of the first tract for each county.
SELECT
DISTINCT
ON
(
left
(
tract_id
,
5
))
left
(
tract_id
,
5
)
As
county
,
tract_id
,
tract_name
FROM
census
.
lu_tracts
ORDER
BY
county
,
tract_id
;
county | tract_id | tract_name -------+-------------+--------------------------------------------------- 25001 | 25001010100 | Census Tract 101, Barnstable County, Massachusetts 25003 | 25003900100 | Census Tract 9001, Berkshire County, Massachusetts 25005 | 25005600100 | Census Tract 6001, Bristol County, Massachusetts 25007 | 25007200100 | Census Tract 2001, Dukes County, Massachusetts 25009 | 25009201100 | Census Tract 2011, Essex County, Massachusetts : (14 rows)
The ON
modifier accepts multiple columns, considering
all of them to determine distinctness. The ORDER BY
clause
has to start with the set of columns in the DISTINCT ON
;
then you can follow with your preferred ordering.
LIMIT
returns only the number of rows indicated; OFFSET
indicates the number of
rows to skip. You can use them in tandem or separately. You almost
always use them in conjunction with an ORDER BY
. In Example 7-11, we demonstrate use of a
positive offset. Leaving out the offset yields the same result as
setting the offset to zero.
Limits and offsets are not unique to PostgreSQL and are in fact copied from MySQL, although implementation differs widely among database products.
SELECT
DISTINCT
ON
(
left
(
tract_id
,
5
))
left
(
tract_id
,
5
)
As
county
,
tract_id
,
tract_name
FROM
census
.
lu_tracts
ORDER
BY
county
,
tract_id
LIMIT
3
OFFSET
2
;
county | tract_id | tract_name -------+-------------+------------------------------------------------- 25005 | 25005600100 | Census Tract 6001, Bristol County, Massachusetts 25007 | 25007200100 | Census Tract 2001, Dukes County, Massachusetts 25009 | 25009201100 | Census Tract 2011, Essex County, Massachusetts (3 rows)
ANSI SQL defines a construct called CAST
that allows you to morph one
data type to another. For example, CAST('2011-1-11'
AS date)
casts the text 2011-1-1
to a date.
PostgreSQL has shorthand for doing this, using a pair of colons, as in
'2011-1-1'::date
. This syntax is shorter and easier to
apply for cases in which you can’t directly cast from one type to
another and have to intercede with one or more intermediary types, such
as someXML::text::integer
.
PostgreSQL supports the multirow constructor to insert more than one record at a time. Example 7-12 demonstrates how to use a multirow construction to insert data into the table we created in Example 6-3.
INSERT
INTO
logs_2011
(
user_name
,
description
,
log_ts
)
VALUES
(
'robe'
,
'logged in'
,
'2011-01-10 10:15 AM EST'
),
(
'lhsu'
,
'logged out'
,
'2011-01-11 10:20 AM EST'
);
The latter portion of the multirow constructor, starting with
the VALUES
keyword, is often referred to as a
values list. A values list can stand alone and effectively creates a table on
the fly, as in Example 7-13.
SELECT
*
FROM
(
VALUES
(
'robe'
,
'logged in'
,
'2011-01-10 10:15 AM EST'
::
timestamptz
),
(
'lhsu'
,
'logged out'
,
'2011-01-11 10:20 AM EST'
::
timestamptz
)
)
AS
l
(
user_name
,
description
,
log_ts
);
When you use VALUES
as a stand-in for a virtual
table, you need to specify the names for the columns. You also need to
explicitly cast the values to the data types in the table if the parser
can’t infer the data type from the data. The multirow VALUES construct
also exists in MySQL and SQL Server.
PostgreSQL is case-sensitive. However, it does have mechanisms in place to ignore
casing. You can apply the upper
function to both sides of the ANSI LIKE
operator, or you can simply use
the ILIKE
(~~*) operator:
SELECT
tract_name
FROM
census
.
lu_tracts
WHERE
tract_name
ILIKE
'%duke%'
;
tract_name ------------------------------------------------ Census Tract 2001, Dukes County, Massachusetts Census Tract 2002, Dukes County, Massachusetts Census Tract 2003, Dukes County, Massachusetts Census Tract 2004, Dukes County, Massachusetts Census Tract 9900, Dukes County, Massachusetts
PostgreSQL has a construct called ANY that can be used in conjunction with arrays, combined with a comparator operator or comparator keyword. If any element of the array matches a row, that row is returned.
Here is an example:
SELECT
tract_name
FROM
census
.
lu_tracts
WHERE
tract_name
ILIKE
ANY
(
ARRAY
[
'%99%duke%'
,
'%06%Barnstable%'
]::
text
[]);
tract_name ----------------------------------------------------- Census Tract 102.06, Barnstable County, Massachusetts Census Tract 103.06, Barnstable County, Massachusetts Census Tract 106, Barnstable County, Massachusetts Census Tract 9900, Dukes County, Massachusetts (4 rows)
The example just shown is a shorthand way of using multiple
ILIKE
OR
clauses. You can use ANY
with other comparators such as LIKE
, =
, and
~
(the regex like operator).
ANY
can be used with any data types and comparison operators (operators that return
a Boolean), including ones you built yourself or installed via
extensions.
A set-returning function is a function that could return more than one row.
PostgreSQL allows set-returning functions to appear in the
SELECT
clause of an SQL statement. This is not true of most
other databases, in which only scalar functions can appear in the
SELECT
.
Interweaving some set-returning functions into an already
complicated query could produce results beyond what you expect, because
these functions usually result in the creation of new rows. You must
anticipate this if you’ll be using the results as a subquery. In Example 7-14, we demonstrate row
creation resulting from using a temporal version of generate_series
. The example uses
a table that we construct with:
CREATE
TABLE
interval_periods
(
i_type
interval
);
INSERT
INTO
interval_periods
(
i_type
)
VALUES
(
'5 months'
),
(
'132 days'
),
(
'4862 hours'
);
SELECT
i_type
,
generate_series
(
'2012-01-01'
::
date
,
'2012-12-31'
::
date
,
i_type
)
As
dt
FROM
interval_periods
;
i_type | dt -----------+----------------------- 5 months | 2012-01-01 00:00:00-05 5 months | 2012-06-01 00:00:00-04 5 months | 2012-11-01 00:00:00-04 132 days | 2012-01-01 00:00:00-05 132 days | 2012-05-12 00:00:00-04 132 days | 2012-09-21 00:00:00-04 4862 hours | 2012-01-01 00:00:00-05 4862 hours | 2012-07-21 15:00:00-04
When you query from a table that has child tables, the query automatically drills
down into the children, creating a union of all the child records
satisfying the query condition. DELETE
and
UPDATE
work the same way, drilling down the hierarchy for victims.
Sometimes this is not desirable because you want data to come only from
the table you specified, without the kids tagging along.
This is where the ONLY
keyword comes in handy. We show an example of its use in Example 7-37, where we want to delete only those
records from the production table that haven’t migrated to the log
table. Without the ONLY
modifier, we’d end up deleting
records from the child table that might have already been moved
previously.
Often, when you delete data from a table, you’ll want to delete the data based
on its presence in another set of data. Specify this additional set with
the USING
predicate. Then, in the WHERE
clause, you can use both datasets in the USING
and in the
FROM
to define conditions for deletion. Multiple tables can
be included in USING
, separated by commas. Example 7-15 deletes all records from census.facts
that correspond to a fact type of short_name =
's01'
.
DELETE
FROM
census
.
facts
USING
census
.
lu_fact_types
As
ft
WHERE
facts
.
fact_type_id
=
ft
.
fact_type_id
AND
ft
.
short_name
=
's01'
;
The standards-compliant way would be to use a clunkier
IN
expression in the WHERE
.
The RETURNING
predicate is supported by ANSI SQL standards but not commonly found in
other relational databases. We show an example in Example 7-37, where we return the records
deleted. RETURNING
can also be used
for inserts and updates. For inserts into tables with a serial key,
RETURNING
is invaluable because it returns the key value of
the new rows—something you wouldn’t know prior to the query execution.
Although RETURNING
is often
accompanied by *
for all fields, you can limit the fields
as we do in Example 7-16.
UPDATE
census
.
lu_fact_types
AS
f
SET
short_name
=
replace
(
replace
(
lower
(
f
.
fact_subcats
[
4
]),
' '
,
'_'
),
':'
,
''
)
WHERE
f
.
fact_subcats
[
3
]
=
'Hispanic or Latino:'
AND
f
.
fact_subcats
[
4
]
>
''
RETURNING
fact_type_id
,
short_name
;
fact_type_id | short_name -------------+------------------------------------------------- 96 | white_alone 97 | black_or_african_american_alone 98 | american_indian_and_alaska_native_alone 99 | asian_alone 100 | native_hawaiian_and_other_pacific_islander_alone 101 | some_other_race_alone 102 | two_or_more_races
New in version 9.5 is the INSERT ON CONFLICT construct, which is often referred to as an UPSERT. This feature is useful if you don’t know a record already exists in a table and rather than having the insert fail, you want it to either update the existing record or do nothing.
This feature requires a unique key, primary key, unique index, or exclusion constraint in place, that when violated, you’d want different behavior like updating the existing record or not doing anything. To demonstrate, imagine we have a table of colors to create:
CREATE TABLE colors(color varchar(50) PRIMARY KEY, hex varchar(6)); INSERT INTO colors(color, hex) VALUES('blue', '0000FF'), ('red', 'FF0000');
We then get a new batch of colors to add to our table, but some
may be present already. If we do a regular insert, we’d get a primary
key violation when we tried to add colors already in the table. When we
run Example 7-17, we get only one
record inserted, the green
that is
not already in our table, and each subsequent run would result in no records being inserted.
INSERT INTO colors(color, hex) VALUES('blue', '0000FF'), ('red', 'FF0000'), ('green', '00FF00') ON CONFLICT DO NOTHING ;
Someone could come and put in a different case 'Blue'
in our system, and we’d then have two different cased blues. To remedy
this, we can put a unique index on our table:
CREATE UNIQUE INDEX uidx_colors_lcolor ON colors USING btree(lower(color));
As before, if we tried to insert a 'Blue'
, we’d be
prevented from doing so and the ON CONFLICT DO NOTHING would result in
nothing happening. If we really wanted to spell the colors as given to
us, we could use code like that given in Example 7-18.
INSERT INTO colors(color, hex) VALUES('Blue', '0000FF'), ('Red', 'FF0000'), ('Green', '00FF00') ON CONFLICT(lower(color)) DO UPDATE SET color = EXCLUDED.color, hex = EXCLUDED.hex;
In Example 7-18 we specified the
conflict, which matches the expression of a constraint or unique index,
so using something like upper(color)
would not work since
the colors table has no matching index for that expression.
In the case of INSERT ON CONFLICT DO UPDATE, you need to specify
the conflicting condition or CONSTRAINT name. If using a constraint,
you’d use ON CONFLICT ON CONSTRAINT
as shown in Example 7-19.constraint_name_here
INSERT INTO colors(color, hex) VALUES('Blue', '0000FF'), ('Red', 'FF0000'), ('Green', '00FF00') ON CONFLICT ON CONSTRAINT colors_pkey DO UPDATE SET color = EXCLUDED.color, hex = EXCLUDED.hex;;
The DO part of the INSERT construct will only happen if there is a primary key, unique index, or unique key constraint error triggered. However, errors such as data type ones or check constraints will fail and never be processed by DO UPDATE.
PostgreSQL automatically creates data types of all tables. Because data types derived from tables contain other data types, they are often called composite data types, or just composites. The first time you see a query with composites, you might be surprised. In fact, you might come across their versatility by accident when making a typo in an SQL statement. Try the following query:
SELECT
x
FROM
census
.
lu_fact_types
As
x
LIMIT
2
;
At first glance, you might think that we left out a
.*
by accident, but check out the result:
x ------------------------------------------------------------------ (86,Population,"{D001,Total:}",d001) (87,Population,"{D002,Total:,""Not Hispanic or Latino:""}",d002)
Instead of erroring out, the preceding example returns the
canonical representation of a lu_fact_type
data type.
Composites can serve as input to several useful functions, among which
are array_agg
and hstore
(a function packaged
with the hstore extension that converts a row into a key-value
pair object).
If you are building web applications, you can take advantage of
the built-in JSON and JSONB support we covered in “JSON” and use a combination of array_agg
and array_to_json
to output a query as a single JSON object as shown in Example 7-20. In PostgreSQL 9.4, you can
use json_agg
. See Example 7-21.
SELECT
array_to_json
(
array_agg
(
f
)
)
As
cat
FROM
(
SELECT
MAX
(
fact_type_id
)
As
max_type
,
category
FROM
census
.
lu_fact_types
GROUP
BY
category
)
As
f
;
This will give you an output of:
cats ---------------------------------------------------- [{"max_type":102,"category":"Population"}, {"max_type":153,"category":"Housing"}]
Defines a subquery with name f
. f
can then be used to reference each row in the subquery.
Aggregate each row of subquerying using array_agg and then convert the array to json with array_to_json.
In version 9.3, the json_agg
function replaces the
chain of array_to_json
and array_agg
, offering
both convenience and speed. In Example 7-21, we repeat Example 7-20 using json_agg
, and
both examples will have the same output.
SELECT
json_agg
(
f
)
As
cats
FROM
(
SELECT
MAX
(
fact_type_id
)
As
max_type
,
category
FROM
census
.
lu_fact_types
GROUP
BY
category
)
As
f
;
In standard ANSI SQL, single quotes (') surround string literals. Should you have a single quote in the string itself, such as last names like O’Nan, possesives like mon’s place, or contractions like can’t, you need to escape it with another. The escape character is another single quote placed in front of the single quote you’re trying to escape. Say you’re writing an insert statement where you copied a large passage from a novel. Affixing yet another single quote to all existing single quotes is both tedious to add and challenging to read. After all, two single quotes look awfully like one double quote, which is another character entirely.
PostgreSQL lets you escape single quotes in strings of any length by surrounding them with two sequential dollar signs ($$), hence the name dollar quoting.
Dollar quoting is also useful in situations where you’re trying to
execute a piece of SQL dynamically, such as
exec(
some sql
)
. In
Example 7-5, we enclosed the
body of a trigger using dollar quoting.
If you are writing an SQL statement that glues two sentences with many single quotes, the ANSI standard way would be to escape as in the following:
SELECT
'It''s O''Neil''s play. '
||
'It''ll start at two o''clock.'
With dollar quoting:
SELECT
$$
It
's O'
Neil
's play. $$ || $$It'
ll
start
at
two
o
'
clock
.
$$
The pair of dollar signs replaces the single quote and escapes all single quotes within.
A variant of dollar quoting is named dollar quoting. We cover this in the following section.
The DO
command allows you to inject a piece of procedural code into your SQL on the fly. You can
think of it as a one-time anonymous function. As an example, we’ll load
the data collected in Example 3-10 into
production tables from our staging table. We’ll use PL/pgSQL for our
procedural snippet, but you’re free to use other languages.
First, we’ll create the table:
set
search_path
=
census
;
DROP
TABLE
IF
EXISTS
lu_fact_types
CASCADE
;
CREATE
TABLE
lu_fact_types
(
fact_type_id
serial
,
category
varchar
(
100
),
fact_subcats
varchar
(
255
)[],
short_name
varchar
(
50
),
CONSTRAINT
pk_lu_fact_types
PRIMARY
KEY
(
fact_type_id
)
);
Then we’ll use DO
to populate it as shown in Example 7-22. CASCADE will force the
drop of any related objects such as foreign key constraints and views,
so be cautious when using CASCADE.
Example 7-22 generates a
series of INSERT INTO SELECT
statements. The SQL also
performs an unpivot operation to convert columnar data into rows.
Example 7-22 is only a
partial listing of the code needed to build
lu_fact
_types. For the full code, refer to the
building_census_tables.sql file that is part of
the book code and data download.
DO language plpgsql $$ DECLARE var_sql text; BEGIN var_sql := string_agg( $sql$ INSERT INTO lu_fact_types(category, fact_subcats, short_name) SELECT 'Housing', array_agg(s$sql$ || lpad(i::text,2,'0') || ') As fact_subcats,' || quote_literal('s' || lpad(i::text,2,'0')) || ' As short_name FROM staging.factfinder_import WHERE s' || lpad(I::text,2,'0') || $sql$ ~ '^[a-zA-Z]+' $sql$, ';' ) FROM generate_series(1,51) As I; EXECUTE var_sql; END $$;
Use of dollar quoting, so we don’t need to escape ' in
Housing. Since the DO command is also wrapped in dollars, we need
to use a named $ delimiter inside. We chose
$sql$
.
Use string_agg
to form a set of SQL statements as a single string of the form
INSERT INTO lu_fact_type(...) SELECT ... WHERE s01 ~
'[a-zA-Z]+';
Execute the SQL.
In Example 7-22, we are using the dollar-quoting syntax covered in “Dollar Quoting” for the body of the DO function and some fragments of the SQL statements inside the function. Since we use dollar quoting to define the whole body of the DO as well as internally, we need to use named dollar quoting for at least one part. The same dollar-quoting nested approach can be used for functon definitions as well.
New in version 9.4 is the FILTER
clause for
aggregates, recently standardized in ANSI SQL. This
replaces the standard CASE WHEN
clause for reducing the
number of rows included in an aggregation. For example, suppose you used
CASE WHEN
to break out average test scores by student, as
shown in Example 7-23.
SELECT
student
,
AVG
(
CASE
WHEN
subject
=
'algebra'
THEN
score
ELSE
NULL
END
)
As
algebra
,
AVG
(
CASE
WHEN
subject
=
'physics'
THEN
score
ELSE
NULL
END
)
As
physics
FROM
test_scores
GROUP
BY
student
;
The FILTER
clause equivalent
for Example 7-23 is shown in Example 7-24.
SELECT
student
,
AVG
(
score
)
FILTER
(
WHERE
subject
=
'algebra'
)
As
algebra
,
AVG
(
score
)
FILTER
(
WHERE
subject
=
'physics'
)
As
physics
FROM
test_scores
GROUP
BY
student
;
In the case of averages and sums and many other aggregates, the
CASE
and FILTER
are equivalent. The benefit is
that FILTER
is a little clearer in purpose and for large
datasets is faster. However, there are some aggregates—such as
array_agg
, which considers NULL
fields—where the CASE
statement gives you extra NULL
values you don’t want. In Example 7-25 we try to get the list of
scores for each subject of interest for each student using the
CASE .. WHEN..
approach.
SELECT
student
,
array_agg
(
CASE
WHEN
subject
=
'algebra'
THEN
score
ELSE
NULL
END
)
As
algebra
,
array_agg
(
CASE
WHEN
subject
=
'physics'
THEN
score
ELSE
NULL
END
)
As
physics
FROM
test_scores
GROUP
BY
student
;
student | algebra | physics --------+---------------------------+------------------------------- jojo | {74,NULL,NULL,NULL,74,..} | {NULL,83,NULL,NULL,NULL,79,..} jdoe | {75,NULL,NULL,NULL,78,..} | {NULL,72,NULL,NULL,NULL,72..} robe | {68,NULL,NULL,NULL,77,..} | {NULL,83,NULL,NULL,NULL,85,..} lhsu | {84,NULL,NULL,NULL,80,..} | {NULL,72,NULL,NULL,NULL,72,..} (4 rows)
Observe that in Example 7-25 we get a bunch of NULL
fields in our arrays. We could work
around this issue with some clever use of subselects, but most of those
will be more verbose and slower than the FILTER
alternative shown in Example 7-26.
SELECT
student
,
array_agg
(
score
)
FILTER
(
WHERE
subject
=
'algebra'
)
As
algebra
,
array_agg
(
score
)
FILTER
(
WHERE
subject
=
'physics'
)
As
physics
FROM
test_scores
GROUP
BY
student
;
student | algebra | physics --------+---------+-------- jojo | {74,74} | {83,79} jdoe | {75,78} | {72,72} robe | {68,77} | {83,85} lhsu | {84,80} | {72,72}
FILTER
works for all aggregate functions, not just
aggregate functions built into PostgreSQL.
New in PostgreSQL 9.4 are statistical functions for computing percentile, median (aka .5 percentile), and mode. These functions are percentile_disc (percentile discrete), percentile_cont (percentile continuous), and mode.
The two percentile functions differ in how they handle even counts. For the discrete function, the first value encountered is taken, so the ordering of the data matters. For the continuous case, values within the same percentile are averaged.
Median is merely the .5 percentile; therefore, it does not deserve a separate function of its own. The mode function finds the most common value. Should there be more than one mode, the first one encountered is returned; therefore, ordering matters, as shown in Example 7-27.
SELECT
student
,
percentile_cont
(
0
.
5
)
WITHIN
GROUP
(
ORDER
BY
score
)
As
cont_median
,
percentile_disc
(
0
.
5
)
WITHIN
GROUP
(
ORDER
BY
score
)
AS
disc_median
,
mode
()
WITHIN
GROUP
(
ORDER
BY
score
)
AS
mode
,
COUNT
(
*
)
As
num_scores
FROM
test_scores
GROUP
BY
student
ORDER
BY
student
;
student | cont_median | disc_median | mode | num_scores --------+-------------+-------------+------+------------ alex | 78 | 77 | 74 | 8 leo | 72 | 72 | 72 | 8 regina | 76 | 76 | 68 | 9 sonia | 73.5 | 72 | 72 | 8 (4 rows)
Example 7-27 computes both the discrete and the continuous median score, which could differ when students have an even number of scores.
The inputs of these functions differ from other aggregate
functions. The column being aggregated is the column in the ORDER BY
clauses of the WITHIN GROUP
modifiers. The column is not
direct input to the function, as we’re used to seeing.
The percentile functions have another variant that accepts an array of percentiles, letting you retrieve multiple percentiles all in one call. Example 7-28 computes the median, the 60 percentile, and the highest score.
SELECT
student
,
percentile_cont
(
'{0.5,0.60,1}'
::
float
[])
WITHIN
GROUP
(
ORDER
BY
score
)
AS
cont_median
,
percentile_disc
(
'{0.5,0.60,1}'
::
float
[])
WITHIN
GROUP
(
ORDER
BY
score
)
AS
disc_median
,
COUNT
(
*
)
As
num_scores
FROM
test_scores
GROUP
BY
student
ORDER
BY
student
;
student | cont_median | disc_median | num_scores --------+----------------+-------------+------------ alex | {78,79.2,84} | {77,79,84} | 8 leo | {72,73.6,84} | {72,72,84} | 8 regina | {76,76.8,90} | {76,77,90} | 9 sonia | {73.5,75.6,86} | {72,75,86} | 8 (4 rows)
As with all aggregates, you can combine these functions with
modifiers. Example 7-29
combines WITHIN GROUP
with FILTER
.
SELECT
student
,
percentile_disc
(
0
.
5
)
WITHIN
GROUP
(
ORDER
BY
score
)
FILTER
(
WHERE
subject
=
'algebra'
)
AS
algebra
,
percentile_disc
(
0
.
5
)
WITHIN
GROUP
(
ORDER
BY
score
)
FILTER
(
WHERE
subject
=
'physics'
)
AS
physics
FROM
test_scores
GROUP
BY
student
ORDER
BY
student
;
student | algebra | physics --------+---------+-------- alex | 74 | 79 leo | 80 | 72 regina | 68 | 83 sonia | 75 | 72 (4 rows)
Window functions are a common ANSI SQL feature. A window function has the
prescience to see and use data beyond the current row; hence the term
window. A window defines which other
rows need to be considered in addition to the current row. Windows let you
add aggregate information to each row of your output where the aggregation
involves other rows in the same window. Window functions such as row_number
and rank
are
useful for ordering your data in sophisticated ways that use rows outside
the selected results but within a window.
Without window functions, you’d have to resort to using joins and subqueries to poll neighboring rows. On the surface, window functions violate the set-based principle of SQL, but we mollify the purist by claiming that they are merely shorthand. You can find more details and examples in Window Functions.
Example 7-30 gives you a quick start.
Using a window function, we can obtain both the detail data and the
average value for all records with fact_type_id
of 86 in one
single SELECT
. Note that the WHERE
clause is always evaluated
before the window function.
SELECT
tract_id
,
val
,
AVG
(
val
)
OVER
()
as
val_avg
FROM
census
.
facts
WHERE
fact_type_id
=
86
;
tract_id | val | val_avg ------------+----------+---------------------- 25001010100 | 2942.000 | 4430.0602165087956698 25001010206 | 2750.000 | 4430.0602165087956698 25001010208 | 2003.000 | 4430.0602165087956698 25001010304 | 2421.000 | 4430.0602165087956698 :
The OVER
sets the boundary of the window. In this
example, because the parentheses contain no constraint, the window covers
all the rows in our WHERE
. So the
average is calculated across all rows with fact_type_id = 86
.
The clause also morphed our conventional AVG
aggregate
function into a window aggregate function. For each row, PostgreSQL
submits all the rows in the window to the AVG
aggregation and
outputs the value as part of the row. Because our window has multiple
rows, the result of the aggregation is repeated. Notice that with window
functions, we were able to perform an aggregation without GROUP
BY
. Furthermore, we were able to rejoin the aggregated result back
with the other variables without using a formal join.
You can use all SQL aggregate functions as window functions. In
addition, you’ll find ROW
, RANK
,
LEAD
, and others listed in Window Functions.
You can run a window function over rows containing particular values instead of using
the whole table. This requires the addition of a PARTITION
BY
clause, which instructs PostgreSQL to take the aggregate over
the indicated rows. In Example 7-31, we repeat what we did in
Example 7-30 but partition our window by
county code, which is always the first five characters of the
tract_id
column. Thus, the rows in each county code are
averaged separately.
SELECT
tract_id
,
val
,
AVG
(
val
)
OVER
(
PARTITION
BY
left
(
tract_id
,
5
))
As
val_avg_county
FROM
census
.
facts
WHERE
fact_type_id
=
2
ORDER
BY
tract_id
;
tract_id | val | val_avg_county ------------+----------+---------------------- 25001010100 | 1765.000 | 1709.9107142857142857 25001010206 | 1366.000 | 1709.9107142857142857 25001010208 | 984.000 | 1709.9107142857142857 : 25003900100 | 1920.000 | 1438.2307692307692308 25003900200 | 1968.000 | 1438.2307692307692308 25003900300 | 1211.000 | 1438.2307692307692308 :
Window functions also allow an ORDER BY
in the OVER
clause. Without getting too abstruse, the
best way to think about this is that all the rows in the window will be
ordered as indicated by ORDER BY
, and the window function
will consider only rows that range from the first row in the window up
to and including the current row in the window or partition. The classic
example uses the ROW_NUMBER
function to sequentially
number rows. In Example 7-32, we
demonstrate how to number our census tracts in alphabetical order. To
arrive at the row number, ROW_NUMBER
counts all rows up to
and including the current row based on the order dictated by the
ORDER BY
.
SELECT
ROW_NUMBER
()
OVER
(
ORDER
BY
tract_name
)
As
rnum
,
tract_name
FROM
census
.
lu_tracts
ORDER
BY
rnum
LIMIT
4
;
rnum | tract_name -----+------------------------------------------------- 1 | Census Tract 1, Suffolk County, Massachusetts 2 | Census Tract 1001, Suffolk County, Massachusetts 3 | Census Tract 1002, Suffolk County, Massachusetts 4 | Census Tract 1003, Suffolk County, Massachusetts
In Example 7-32, we also have an
ORDER BY
for the entire query. Don’t get confused between
this and the ORDER BY
that’s specific to the window
function.
You can combine ORDER BY
with PARTITION
BY
, restarting the ordering for each partition. Example 7-33 returns to our example
of county codes.
SELECT
tract_id
,
val
,
SUM
(
val
)
OVER
(
PARTITION
BY
left
(
tract_id
,
5
)
ORDER
BY
val
)
As
sum_county_ordered
FROM
census
.
facts
WHERE
fact_type_id
=
2
ORDER
BY
left
(
tract_id
,
5
),
val
;
tract_id | val | sum_county_ordered -------------+----------+----------------- 25001014100 | 226.000 | 226.000 25001011700 | 971.000 | 1197.000 25001010208 | 984.000 | 2181.000 : 25003933200 | 564.000 | 564.000 25003934200 | 593.000 | 1157.000 25003931300 | 606.000 | 1763.000 :
The key observation to make in the output is how the sum changes
from row to row. The ORDER BY
clause means that the sum
will be taken only from the beginning of the partition to the current
row, giving you a running total, where the location of the current row
in the list is dictated by the ORDER BY
clause. For
instance, if your row is in the fifth row in the third partition, the
sum will cover only the first five rows in the third partition. We put
an ORDER BY left(tract_id,5), val
at the end of the query
so you can easily see the pattern, but keep in mind that the ORDER
BY
of the query is independent of the ORDER BY
in
each OVER
clause.
You can explicitly control the rows under consideration by adding
a RANGE
or ROWS
clause: ROWS BETWEEN
CURRENT ROW AND 5 FOLLOWING
.
PostgreSQL also supports window naming, which is useful if you
have the same window for each of your window columns. Example 7-34 demonstrates how to name windows
as well as how to use the LEAD
and
LAG
window functions to show a record value before and after for a
given partition.
SELECT
*
FROM
(
SELECT
ROW_NUMBER
(
)
OVER
(
wt
)
As
rnum
,
substring
(
tract_id
,
1
,
5
)
As
county_code
,
tract_id
,
LAG
(
tract_id
,
2
)
OVER
wt
As
tract_2_before
,
LEAD
(
tract_id
)
OVER
wt
As
tract_after
FROM
census
.
lu_tracts
WINDOW
wt
AS
(
PARTITION
BY
substring
(
tract_id
,
1
,
5
)
ORDER
BY
tract_id
)
)
As
x
WHERE
rnum
BETWEEN
2
and
3
AND
county_code
IN
(
'25007'
,
'25025'
)
ORDER
BY
county_code
,
rnum
;
rnum | county_code | tract_id | tract_2_before | tract_after -----+-------------+-------------+----------------+------------ 2 | 25007 | 25007200200 | | 25007200300 3 | 25007 | 25007200300 | 25007200100 | 25007200400 2 | 25025 | 25025000201 | | 25025000202 3 | 25025 | 25025000202 | 25025000100 | 25025000301
Both LEAD
and LAG
take an optional step
argument that defines how many rows to skip forward or backward; the
step can be positive or negative. LEAD
and LAG
return NULL
when trying to retrieve
rows outside the window partition. This is a possibility that you always
have to account for.
In PostgreSQL, any aggregate function you create can be used as a
window function. Other databases tend to limit window functions to using
built-in aggregates such as AVG
, SUM
,
MIN
, and MAX
.
Essentially, common table expressions (CTEs) allow you to define a query that can be reused in a larger query. CTEs act as temporary tables defined within the scope of the statement; they’re gone once the enclosing statement has finished executing.
There are three ways to use CTEs:
This is your plain-vanilla CTE, used to make your SQL more readable or to encourage the planner to materialize a costly intermediate result for better performance.
This is an extension of the basic CTE with
UPDATE
, INSERT
, and DELETE
commands. A common final step in the CTE is to return changed
rows.
This puts an entirely new whirl on standard CTE. The rows returned by a recursive CTE vary during the execution of the query.
PostgreSQL allows you to have a CTE that is both writable and recursive.
The basic CTE looks like Example 7-35. The
WITH
keyword introduces the CTE.
WITH
cte
AS
(
SELECT
tract_id
,
substring
(
tract_id
,
1
,
5
)
As
county_code
,
COUNT
(
*
)
OVER
(
PARTITION
BY
substring
(
tract_id
,
1
,
5
))
As
cnt_tracts
FROM
census
.
lu_tracts
)
SELECT
MAX
(
tract_id
)
As
last_tract
,
county_code
,
cnt_tracts
FROM
cte
WHERE
cnt_tracts
>
100
GROUP
BY
county_code
,
cnt_tracts
;
cte
is the name of the CTE in Example 7-35, defined using a SELECT
statement to contain three columns: tract_id
,
county_code
, and cnt_tracts
. The main
SELECT
refers to the CTE.
You can stuff as many CTEs as you like, separated by commas, into
the WITH
clause, as shown in Example 7-36. The order of the CTEs matters in that CTEs
defined later can call CTEs defined earlier, but not vice versa.
WITH
cte1
AS
(
SELECT
tract_id
,
substring
(
tract_id
,
1
,
5
)
As
county_code
,
COUNT
(
*
)
OVER
(
PARTITION
BY
substring
(
tract_id
,
1
,
5
))
As
cnt_tracts
FROM
census
.
lu_tracts
),
cte2
AS
(
SELECT
MAX
(
tract_id
)
As
last_tract
,
county_code
,
cnt_tracts
FROM
cte1
WHERE
cnt_tracts
<
8
GROUP
BY
county_code
,
cnt_tracts
)
SELECT
c
.
last_tract
,
f
.
fact_type_id
,
f
.
val
FROM
census
.
facts
As
f
INNER
JOIN
cte2
c
ON
f
.
tract_id
=
c
.
last_tract
;
The writable CTE extends the CTE to allow for update, delete, and insert statements. We’ll revisit our logs tables that we created in Example 6-3, adding another child table and populating it:
CREATE
TABLE
logs_2011_01_02
(
PRIMARY
KEY
(
log_id
),
CONSTRAINT
chk
CHECK
(
log_ts
>=
'2011-01-01'
AND
log_ts
<
'2011-03-01'
)
)
INHERITS
(
logs_2011
);
In Example 7-37, we move data from
our parent 2011 table to our new child Jan-Feb 2011 table. The
ONLY
keyword is described in “Restricting DELETE, UPDATE, and SELECT from Inherited
Tables”
and the RETURNING
keyword in “Returning Affected Records to the User”.
WITH
t
AS
(
DELETE
FROM
ONLY
logs_2011
WHERE
log_ts
<
'2011-03-01'
RETURNING
*
)
INSERT
INTO
logs_2011_01_02
SELECT
*
FROM
t
;
The official
documentation for PostgreSQL describes it best: “The optional RECURSIVE
modifier
changes CTE from a mere syntactic convenience into a feature that
accomplishes things not otherwise possible in standard SQL.” A more
interesting CTE is one that uses a recursively defining construct to
build an expression. PostgreSQL recursive CTEs utilize UNION
ALL
to combine tables, a kind of combination that can be done
repeatedly as the query adds the tables over and over.
To turn a basic CTE to a recursive one, add the
RECURSIVE
modifier after the WITH
. WITH
RECURSIVE
can contain a mix of recursive and nonrecursive table
expressions. In most other databases, the RECURSIVE
keyword
is not necessary to denote recursion.
A common use of recursive CTEs is to represent message threads and other tree-like structures. We have an example of this in Recursive CTE to Display Tree Structures.
In Example 7-38, we query the system catalog to list the cascading table relationships we have in our database.
WITH
RECURSIVE
tbls
AS
(
SELECT
c
.
oid
As
tableoid
,
n
.
nspname
AS
schemaname
,
c
.
relname
AS
tablename
FROM
pg_class
c
LEFT
JOIN
pg_namespace
n
ON
n
.
oid
=
c
.
relnamespace
LEFT
JOIN
pg_tablespace
t
ON
t
.
oid
=
c
.
reltablespace
LEFT
JOIN
pg_inherits
As
th
ON
th
.
inhrelid
=
c
.
oid
WHERE
th
.
inhrelid
IS
NULL
AND
c
.
relkind
=
'r'
:
:
"char"
AND
c
.
relhassubclass
UNION
ALL
SELECT
c
.
oid
As
tableoid
,
n
.
nspname
AS
schemaname
,
tbls
.
tablename
|
|
'->'
|
|
c
.
relname
AS
tablename
FROM
tbls
INNER
JOIN
pg_inherits
As
th
ON
th
.
inhparent
=
tbls
.
tableoid
INNER
JOIN
pg_class
c
ON
th
.
inhrelid
=
c
.
oid
LEFT
JOIN
pg_namespace
n
ON
n
.
oid
=
c
.
relnamespace
LEFT
JOIN
pg_tablespace
t
ON
t
.
oid
=
c
.
reltablespace
)
SELECT
*
FROM
tbls
ORDER
BY
tablename
;
tableoid | schemaname | tablename ---------+------------+--------------------------------------- 3152249 | public | logs 3152260 | public | logs->logs_2011 3152272 | public | logs->logs_2011->logs_2011_01_02
Get a list of all tables that have child tables but no parent table.
This is the recursive part; it gets all children of tables in
tbls
.
The names of the child tables start with the parental name.
Return parents and all child tables. Because we sort by table name, which prepends the parent name, all child tables will follow their parents in their output.
LATERAL
is a new ANSI SQL
construction in version 9.3. Here’s the motivation behind it: suppose you
perform joins on two tables or subqueries; normally, the pair
participating in the join are independent units and can’t read data from
each other. For example, the following interaction would generate an error
because l.yr = 2011
is not a column on
the righthand side of the join:
SELECT
*
FROM
census
.
facts
L
INNER
JOIN
(
SELECT
*
FROM
census
.
lu_fact_types
WHERE
category
=
CASE
WHEN
L
.
yr
=
2011
THEN
'Housing'
ELSE
category
END
)
R
ON
L
.
fact_type_id
=
R
.
fact_type_id
;
Now add the LATERAL
keyword, and
the error is gone:
SELECT
*
FROM
census
.
facts
L
INNER
JOIN
LATERAL
(
SELECT
*
FROM
census
.
lu_fact_types
WHERE
category
=
CASE
WHEN
L
.
yr
=
2011
THEN
'Housing'
ELSE
category
END
)
R
ON
L
.
fact_type_id
=
R
.
fact_type_id
;
LATERAL
lets you share data in columns across two
tables in a FROM
clause. However, it
works only in one direction: the righthand side can draw from the lefthand
side, but not vice versa.
There are situations when you should avail yourself of LATERAL
to avoid extremely convoluted syntax. In
Example 7-39, a column on the left
serves as a parameter in the generate_series
function on the
right:
CREATE
TABLE
interval_periods
(
i_type
interval
);
INSERT
INTO
interval_periods
(
i_type
)
VALUES
(
'5 months'
),
(
'132 days'
),
(
'4862 hours'
);
SELECT
i_type
,
dt
FROM
interval_periods
CROSS
JOIN
LATERAL
generate_series
(
'2012-01-01'
::
date
,
'2012-12-31'
::
date
,
i_type
)
AS
dt
WHERE
NOT
(
dt
=
'2012-01-01'
AND
i_type
=
'132 days'
::
interval
);
i_type | dt ------------+----------------------- 5 mons | 2012-01-01 00:00:00-05 5 mons | 2012-06-01 00:00:00-04 5 mons | 2012-11-01 00:00:00-04 132 days | 2012-05-12 00:00:00-04 132 days | 2012-09-21 00:00:00-04 4862:00:00 | 2012-01-01 00:00:00-05 4862:00:00 | 2012-07-21 15:00:00-04
Lateral is also helpful for using values from the lefthand side to
limit the number of rows returned from the righthand side. Example 7-40 uses LATERAL
to return, for each superuser who has used our site within the last 100
days, the last five logins and what they were up to. Tables used in this
example were created in “TYPE OF” and “Basic Table Creation”.
SELECT
u
.
user_name
,
l
.
description
,
l
.
log_ts
FROM
super_users
AS
u
CROSS
JOIN
LATERAL
(
SELECT
description
,
log_ts
FROM
logs
WHERE
log_ts
>
CURRENT_TIMESTAMP
-
interval
'100 days'
AND
logs
.
user_name
=
u
.
user_name
ORDER
BY
log_ts
DESC
LIMIT
5
)
AS
l
;
Although you can achieve the same results by using window functions, lateral joins yield faster results with more succinct syntax.
You can use multiple lateral joins in your SQL and even chain them
in sequence as you would when joining more than two subqueries. You can
sometimes get away with omitting the LATERAL
keyword; the
query parser is smart enough to figure out a lateral join if you have a
correlated expression. But we advise that you always include the keyword
for the sake of clarity. Also, you’ll get an error if you write your
statement assuming the use of a lateral join but run the statement on a
prelateral version PostgreSQL. Without the keyword, PostgreSQL might end
up performing a join with unintended results.
Other database products also offer lateral joins, although they
don’t abide by the ANSI moniker. In Oracle, you’d use a table pipeline
construct. In SQL Server, you’d use CROSS APPLY
or OUTER
APPLY
.
Introduced in version 9.4, the WITH ORDINALITY
clause
is an SQL ANSI standard construct. WITH
ORDINALITY
adds a sequential number column to a set-returning
function result.
Although you can’t use WITH ORDINALITY
with tables
and subqueries, you can achieve the same result for those by using the
window function ROW_NUMBER.
You’ll find WITH ORDINALITY
often used with functions
like generate_series
, unnest
, and other
functions that expand out composite types and arrays. It can be used with
any set-returning function, including ones you create
yourself.
Example 7-41 demonstrates WITH
ORDINALITY
used in conjunction with the temporal variant of
the generate_series
function.
SELECT
dt
.
*
FROM
generate_series
(
'2016-01-01'
::
date
,
'2016-12-31'
::
date
,
interval
'1 month'
)
WITH
ORDINALITY
As
dt
;
dt | ordinality -----------------------+----------- 2016-01-01 00:00:00-05 | 1 2016-02-01 00:00:00-05 | 2 2016-03-01 00:00:00-05 | 3 2016-04-01 00:00:00-04 | 4 2016-05-01 00:00:00-04 | 5 2016-06-01 00:00:00-04 | 6 2016-07-01 00:00:00-04 | 7 2016-08-01 00:00:00-04 | 8 2016-09-01 00:00:00-04 | 9 2016-10-01 00:00:00-04 | 10 2016-11-01 00:00:00-04 | 11 2016-12-01 00:00:00-05 | 12 (12 rows)
WITH ORDINALITY
always adds an additional column at the
end of the result called ordinality, and WITH
ORDINALITY
can only appear in the FROM
clause of an SQL statement. You are
free to rename the ordinality column.
You’ll often find WITH ORDINALITY
paired with the
LATERAL
construct. In Example 7-42 we repeat the LATERAL
in
Example 7-39, but add on a sequential
number to each set.
SELECT
d
.
ord
,
i_type
,
d
.
dt
FROM
interval_periods
CROSS
JOIN
LATERAL
generate_series
(
'2012-01-01'
::
date
,
'2012-12-31'
::
date
,
i_type
)
WITH
ORDINALITY
AS
d
(
dt
,
ord
)
WHERE
NOT
(
dt
=
'2012-01-01'
AND
i_type
=
'132 days'
::
interval
);
ord | i_type | dt ----+------------+----------------------- 1 | 5 mons | 2012-01-01 00:00:00-05 2 | 5 mons | 2012-06-01 00:00:00-04 3 | 5 mons | 2012-11-01 00:00:00-04 2 | 132 days | 2012-05-12 00:00:00-04 3 | 132 days | 2012-09-21 00:00:00-04 1 | 4862:00:00 | 2012-01-01 00:00:00-05 2 | 4862:00:00 | 2012-07-21 15:00:00-04 (7 rows)
In Example 7-42, WITH
ORDINALITY
gets applied to the result of the set-returning
function. It always gets applied before the WHERE
condition.
As a result, there is a gap in numbering in the final result (the number 1
is lacking for the 132 day interval), because the number was filtered out
by our WHERE
condition.
If we didn’t have the WHERE
condition excluding the
2012-01-01, 132 day record, we would have 8 rows with the 4th row
being 1 | 132 days | 2012-01-01
00:00:00-04
If you’ve ever tried to create a summary report that includes both totals and subtotals, you’ll appreciate the capability to partition your data on the fly. Grouping sets let you do exactly that.
For our table of test scores, if we need to find both the overall average per student and the average per student by subject, we could write a query as shown in Example 7-43, taking advantage of grouping sets.
SELECT
student
,
subject
,
AVG
(
score
)::
numeric
(
10
,
2
)
FROM
test_scores
WHERE
student
IN
(
'leo'
,
'regina'
)
GROUP
BY
GROUPING
SETS
((
student
),(
student
,
subject
))
ORDER
BY
student
,
subject
NULLS
LAST
;
student | subject | avg ---------+-----------+------- leo | algebra | 82.00 leo | calculus | 65.50 leo | chemistry | 75.50 leo | physics | 72.00 leo | NULL | 73.75 regina | algebra | 72.50 regina | calculus | 64.50 regina | chemistry | 73.50 regina | economics | 90.00 regina | physics | 84.00 regina | NULL | 75.44 (11 rows)
In a single query, Example 7-43 gives us both the average of each student across all subjects and his or her average in each subject.
We can even include a total for each subject across all students by having multiple grouping sets as shown in Example 7-44.
SELECT
student
,
subject
,
AVG
(
score
)::
numeric
(
10
,
2
)
FROM
test_scores
WHERE
student
IN
(
'leo'
,
'regina'
)
GROUP
BY
GROUPING
SETS
((
student
,
subject
),(
student
),(
subject
))
ORDER
BY
student
NULLS
LAST
,
subject
NULLS
LAST
;
student | subject | avg ---------+-----------+------- leo | algebra | 82.00 leo | calculus | 65.50 leo | chemistry | 75.50 leo | physics | 72.00 leo | NULL | 73.75 regina | algebra | 72.50 regina | calculus | 64.50 regina | chemistry | 73.50 regina | economics | 90.00 regina | physics | 84.00 regina | NULL | 75.44 NULL | algebra | 77.25 NULL | calculus | 65.00 NULL | chemistry | 74.50 NULL | economics | 90.00 NULL | physics | 78.00 (16 rows)
What if we wanted to have total breakdowns for student, student plus
subject, and overall average? We could revise our query to add a universal
grouping set GROUPING SETS ((student),(student, subject),())
.
This is equivalent to the shorthand ROLLUP (student, subject)
. See Example 7-45.
SELECT
student
,
subject
,
AVG
(
score
)::
numeric
(
10
,
2
)
FROM
test_scores
WHERE
student
IN
(
'leo'
,
'regina'
)
GROUP
BY
ROLLUP
(
student
,
subject
)
ORDER
BY
student
NULLS
LAST
,
subject
NULLS
LAST
;
student | subject | avg ---------+-----------+------- leo | algebra | 82.00 leo | calculus | 65.50 leo | chemistry | 75.50 leo | physics | 72.00 leo | NULL | 73.75 regina | algebra | 72.50 regina | calculus | 64.50 regina | chemistry | 73.50 regina | economics | 90.00 regina | physics | 84.00 regina | NULL | 75.44 NULL | NULL | 74.65 (12 rows)
If we reverse the order of columns in ROLLUP, we get the score for each student/subject pair, average for each subject, and overall average as shown in Example 7-46.
SELECT
student
,
subject
,
AVG
(
score
)::
numeric
(
10
,
2
)
FROM
test_scores
WHERE
student
IN
(
'leo'
,
'regina'
)
GROUP
BY
ROLLUP
(
subject
,
student
)
ORDER
BY
student
NULLS
LAST
,
subject
NULLS
LAST
;
student | subject | avg ---------+-----------+------- leo | algebra | 82.00 leo | calculus | 65.50 leo | chemistry | 75.50 leo | physics | 72.00 regina | algebra | 72.50 regina | calculus | 64.50 regina | chemistry | 73.50 regina | economics | 90.00 regina | physics | 84.00 NULL | algebra | 77.25 NULL | calculus | 65.00 NULL | chemistry | 74.50 NULL | economics | 90.00 NULL | physics | 78.00 NULL | NULL | 74.65 (15 rows)
If we also wanted to include subtotals for just the subject and just
the student, we’d use GROUPING SETS ( (student), (student, subject),
(subject), () )
, or the shorthand CUBE (student, subject)
in Example 7-47.
SELECT
student
,
subject
,
AVG
(
score
)::
numeric
(
10
,
2
)
FROM
test_scores
WHERE
student
IN
(
'leo'
,
'regina'
)
GROUP
BY
CUBE
(
student
,
subject
)
ORDER
BY
student
NULLS
LAST
,
subject
NULLS
LAST
;
student | subject | avg ---------+-----------+------- leo | algebra | 82.00 leo | calculus | 65.50 leo | chemistry | 75.50 leo | physics | 72.00 leo | NULL | 73.75 regina | algebra | 72.50 regina | calculus | 64.50 regina | chemistry | 73.50 regina | economics | 90.00 regina | physics | 84.00 regina | NULL | 75.44 NULL | algebra | 77.25 NULL | calculus | 65.00 NULL | chemistry | 74.50 NULL | economics | 90.00 NULL | physics | 78.00 NULL | NULL | 74.65 (17 rows)