PostgreSQL supports the workhorse data types of any database: numerics, strings, dates, times, and booleans. But PostgreSQL sprints ahead by adding support for arrays, time zone−aware datetimes, time intervals, ranges, JSON, XML, and many more. If that’s not enough, you can invent custom types. In this chapter, we don’t intend to cover every data type. For that, there’s always the manual. We showcase data types that are unique to PostgreSQL and nuances in how PostgreSQL handles common data types.
No data type would be useful without a cast of supporting functions and operators. And PostgreSQL has plenty of them. We’ll cover the more popular ones in this chapter.
When we use the term function, we’re talking about something that’s of the form
f(x)
. When we use the term operator,
we’re talking about something that’s symbolic and either unary
(having one argument) or binary (having two arguments) such as
+
, -
, *
, or /
. When
using operators, keep in mind that the same symbol can take on a different
meaning when applied to different data types. For example, the plus sign
means adding for numerics but unioning for ranges.
You will find your everyday integers, decimals, and floating-point numbers in PostgreSQL. Of the numeric types, we want to discuss serial data types and a nifty function to quickly generate arithmetic series of integers.
Serial and its bigger sibling, bigserial, are auto-incrementing
integers often used as primary keys of tables in which a natural key is not
apparent. This data type goes by different names in different database
products, with autonumber being the most common
alternative moniker. When you create a table and specify a column as
serial, PostgreSQL first creates an integer column and then creates a
sequence object named
located in the same schema as the table. It then sets the default of the
new integer column to read its value from the sequence. If you drop the column, PostgreSQL also drops the companion
sequence object.table_name
_column_name
_seq
In PostgreSQL, the sequence type is a database asset in its own
right. You can inspect and edit the sequences using SQL with the ALTER SEQUENCE
command or using PGAdmin.
You can set the current value, boundary values (both the upper and lower
bounds), and even how many numbers to increment each time. Though
decrementing is rare, you can do it by setting the increment value to a
negative number. Because sequences are independent database assets, you
can create them separately from a table using the CREATE
SEQUENCE
command, and you can use the same sequence
across multiple tables. The cross-table sharing of the same sequence
comes in handy when you’re assigning a universal key in your
database.
To use an extant sequence for subsequent tables, create a new
column in the table as integer or bigint—not as serial—then set the default value
of the column using the nextval(
function as shown in Example 5-1.sequence_name
)
CREATE
SEQUENCE
s
START
1
;
CREATE
TABLE
stuff
(
id
bigint
DEFAULT
nextval
(
's'
)
PRIMARY
KEY
,
name
text
);
If you rename a table that has a serial based on a sequence, PostgreSQL will not automatically rename the sequence object. To avoid confusion, you should rename the sequence object.
PostgreSQL has a nifty function called generate_series
not
found in other database products. The function comes in two forms. One
is a numeric version that creates a sequence of integers incremented by
some value and one that creates a sequence of dates or timestamps
incremented by some time interval. What makes
generate_series
so convenient is that it allows you to
effectively mimic a for loop in SQL. Example 5-2 demonstrates the numeric
version. Example 5-13 demonstrates
the temporal version.
Example 5-2 uses integers with an optional step parameter.
SELECT
x
FROM
generate_series
(
1
,
51
,
13
)
As
x
;
x ---- 1 14 27 40
The default step is 1. As demonstrated in Example 5-2, you can pass in an optional step argument to specify how many steps to skip for each successive element. The end value will never exceed our prescribed range, so although our range ends at 51, our last number is 40 because adding another 13 to our 40 busts the upper bound.
There are three primitive textual types in PostgreSQL: character (abbreviable as char),
character varying
(abbreviable as varchar), and text.
Use char only when the values stored are fixed length, such as postal codes, phone numbers, and Social Security numbers in the US. If your value is under the length specified, PostgreSQL automatically adds spaces to the end. When compared with varchar or text, the right-padding takes up more superfluous storage, but you get the assurance of an invariable length. There is absolutely no speed performance benefit of using char over varchar or text and char will always take up more disk space. Use character varying to store strings with varying length. When defining varchar columns, you should specify the maximum length of a varchar. Text is the most generic of the textual data types. With text, you cannot specify a maximum length.
The max length modifier for varchar is optional. Without it, varchar
behaves almost identically to text. Subtle differences do surface when
connecting to PostgreSQL via drivers. For instance, the ODBC driver cannot
sort text columns. Both varchar and text have a maximum storage of 1G for
each value—that’s a lot! Behind the scenes, any value larger than what can
fit in a record page gets pushed to TOAST
.
Some folks advocate abandoning varchar and always using text. Rather than waste space arguing about it here, read the debate at In Defense of Varchar(X).
Often, for cross-system compatibility, you want to remove case sensitivity from your character types. To do this, you need to override comparison operators that take case into consideration. Overriding operators is easier for varchar than it is for text. We demonstrate an example in Using MS Access with PostgreSQL, where we show how to make varchar behave without case sensitivity and still be able to use an index.
Common string manipulations are padding (lpad
, rpad
), trimming
whitespace (rtrim
, ltrim
,
trim
, btrim
), extracting substrings (substring
), and concatenating (||
). Example 5-3 demonstrates padding, and Example 5-4 demonstrates trimming.
SELECT
lpad
(
'ab'
,
4
,
'0'
)
As
ab_lpad
,
rpad
(
'ab'
,
4
,
'0'
)
As
ab_rpad
,
lpad
(
'abcde'
,
4
,
'0'
)
As
ab_lpad_trunc
;
ab_lpad | ab_rpad | ab_lpad_trunc --------+---------+--------------- 00ab | ab00 | abcd
By default, trim functions remove spaces, but you can pass in an optional argument indicating other characters to trim.
SELECT
a
As
a_before
,
trim
(
a
)
As
a_trim
,
rtrim
(
a
)
As
a_rt
,
i
As
i_before
,
ltrim
(
i
,
'0'
)
As
i_lt_0
,
rtrim
(
i
,
'0'
)
As
i_rt_0
,
trim
(
i
,
'0'
)
As
i_t_0
FROM
(
SELECT
repeat
(
' '
,
4
)
||
i
||
repeat
(
' '
,
4
)
As
a
,
'0'
||
i
As
i
FROM
generate_series
(
0
,
200
,
50
)
As
i
)
As
x
;
a_before | a_trim | a_rt | i_before | i_lt_0 | i_rt_0 | i_t_0 ---------+--------+------+----------+--------+--------+------ 0 | 0 | 0 | 00 | | | 50 | 50 | 50 | 050 | 50 | 05 | 5 100 | 100 | 100 | 0100 | 100 | 01 | 1 150 | 150 | 150 | 0150 | 150 | 015 | 15 200 | 200 | 200 | 0200 | 200 | 02 | 2
A helpful function for aggregating strings is the string_agg function, which we demonstrate in Examples 3-14 and 5-26.
There are a couple of useful functions in PostgreSQL for tearing strings apart.
The split_part
function is useful for extracting an element from a delimited
string, as shown in Example 5-5. Here, we
select the second item in a string of items delimited by periods.
SELECT
split_part
(
'abc.123.z45'
,
'.'
,
2
)
As
x
;
x --- 123
The string_to_array
function is useful for creating an array of elements from a delimited
string. By combining string_to_array
with the
unnest
function, you can expand the returned array into a set of rows, as
shown in Example 5-6.
SELECT
unnest
(
string_to_array
(
'abc.123.z45'
,
'.'
))
As
x
;
x --- abc 123 z45
PostgreSQL’s regular expression support is downright fantastic. You can return matches as tables or arrays and choreograph replaces and updates. Back-referencing and other fairly advanced search patterns are also supported. In this section, we’ll provide a small sampling. For more information, see Pattern Matching and String Functions.
Example 5-7 shows you how to format phone numbers stored simply as contiguous digits.
SELECT
regexp_replace
(
'6197306254'
,
'([0-9]{3})([0-9]{3})([0-9]{4})'
,
E
'(\1) \2-\3'
)
As
x
;
x -------------- (619) 730-6254
The \1
, \2
, etc., refer to elements in
our pattern expression. We use a backslash () to escape the parentheses.
The
E'
construct is PostgreSQL syntax for denoting that the
string to follow should be taken literally.
Suppose some field contains text with embedded phone numbers; Example 5-8 shows how to extract the phone numbers and turn them into rows all in one step.
SELECT
unnest
(
regexp_matches
(
'Cell (619) 852-5083. Work (619)123-4567 , Casa 619-730-6254. Bésame mucho.'
,
E
'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}'
,
'g'
)
)
As
x
;
x -------------- (619) 852-5083 (619)123-4567 619-730-6254 (3 rows)
The matching rules for Example 5-8 are:
[(]{0,1}
: starts with zero or one open
parenthesis.
[0-9]{3}
: followed by three digits.
[)-.]{0,1}
: followed by zero or one closed
parenthesis, hyphen, or period.
[\s]+
: followed by zero or more spaces.
[0-9]{4}
: followed by four digits.
regexp_matches
returns a string array consisting of matches of a regular
expression. The last input to our function is the flags
parameter. We set this to g
, which stands for
global and returns all matches of a regular
expression as separate elements. If you leave out this
flags
parameter, then your array will only contain the
first match. The flags
parameter can consist of more
than one flag. For example, if you have letters in your regular
expression and text and you want to make the check case insensitive
and global, you would use two flags, gi
. In addition to
the global flag, other allowed flags are listed in POSIX EMBEDDED OPTIONS.
There are many ways to compose the same regular expression. For
instance, \d
is shorthand for [0-9]
. But
given the few characters you’d save, we prefer the more descriptive
longhand.
If you only care about the first match, you can utilize the
substring
function, which will return the first matching
value as shown in Example 5-9.
SELECT
substring
(
'Cell (619) 852-5083. Work (619)123-4567 , Casa 619-730-6254. Bésame mucho.'
from
E
'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}'
)
As
x
;
x ---------------- (619) 852-5083 (1 row)
In addition to the wealth of regular expression functions, you can
use regular expressions with the SIMILAR TO
(~
) operators. The following example returns all description fields with
embedded phone numbers:
SELECT description
FROM mytable
WHERE description ~
E'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}';
PostgreSQL support for temporal data is second to none. In addition to the usual dates and
times types, PostgreSQL supports time zones, enabling the automatic
handling of daylight saving time (DST) conversions by region. Specialized data
types such as interval
offer datetime arithmetic.
PostgreSQL also understands infinity and negative infinity, relieving us
from having to create conventions that we’ll surely forget. Range types provide support for
temporal ranges with a slew of companion operators,
functions, and indexes. We cover range types in “Range Types”.
At last count, PostgreSQL has nine temporal data types.
Understanding their distinctions is important in ensuring that you choose
the right data type for the job. All the types except range
abide by ANSI SQL standards. Other leading
database products support some, but not all, of these data types. Oracle
has the most varieties of temporal types; SQL Server ranks second; and
MySQL comes in last.
PostgreSQL temporal types vary in a number of ways to handle different situations. If a type is time zone−aware, the time changes if you change your server’s time zone. The types are:
date
Stores the month, day, and year, with no time zone awareness and no concept of hours, minutes, or seconds.
time
(aka time without time
zone
)Stores hours, minutes, and seconds with no awareness of time zone or calendar dates.
timestamp
(aka timestamp without time
zone
)Stores both calendar dates and time (hours, minutes, seconds) but does not care about the time zone.
timestamptz
(aka timestamp with time
zone
)A time zone−aware date and time data type. Internally,
timestamptz
is stored in Coordinated Universal Time
(UTC), but its display defaults to the time zone of the server,
the service config, the database, the user, or the session. Yes, you
can observe different time zones at different levels. If you input a
timestamp with no time zone and cast it to one with the time zone,
PostgreSQL assumes the default time zone in effect. If you don’t set
your time zone in postgresql.conf, the server’s default
takes effect. This means that if you change your server’s time zone,
you’ll see all the displayed times change after the PostgreSQL
server restarts.
timetz
(aka time with time
zone
)The lesser-used sister of timestamptz
. It is time zone−aware but
does not store the date. It always assumes DST of the current date
and time. Some programming languages with no concept of time without
date might map timetz
to a timestamp with some
arbitrary date such as Unix Epoch 1970, resulting in year 1970 being
assumed.
interval
A duration of time in hours, days, months, minutes, and others. It comes in handy for datetime arithmetic. For example, if the world is supposed to end in exactly 666 days from now, all you have to do is add an interval of 666 days to the current time to get the exact moment (and plan accordingly).
tsrange
Allows you to define opened and closed ranges of timestamp with no timezone
. The type
consists of two timestamps and opened/closed range qualifiers. For
example, '[2012-01-01 14:00, 2012-01-01
15:00)'::tsrange
defines a period starting at 14:00 but
ending before 15:00. Refer to Range Types for details.
tstzrange
Allows you to define opened and closed ranges of timestamp with timezone
.
daterange
A common misconception with PostgreSQL time zone−aware data types is that PostgreSQL
records an extra time marker with the datetime value itself. This is
incorrect. If you save 2012-2-14
18:08:00-8
(-8 being the Pacific offset from UTC), PostgreSQL
internally thinks like this:
Calculate the UTC time for 2012-02-14 18:08:00-8. This is 2012-02-15 04:08:00-0.
Store the value 2012-02-15
04:08:00
.
When you call the data back for display, PostgreSQL internally works like this:
Start with the requested time zone, defaulting to the server time zone if none is requested.
Compute the offset for time zone for this UTC time (-5 for
America/New_York
).
Determine the datetime with the offset (2012-02-15 16:08:00 with a -5 offset becomes 2012-02-15 21:08:00).
Display the result (2012-02-15
21:08:00-5
).
So PostgreSQL doesn’t store the time zone, but uses it only to convert the datetime to UTC before storage. After that, the time zone information is discarded. When PostgreSQL displays datetime, it does so in the default time zone dictated by the session, user, database, or server, in that order. If you use time zone−aware data types, you should consider the consequence of a server move from one time zone to another. Suppose you based a server in New York City and subsequently restored the database in Los Angeles. All timestamps with time zone fields could suddenly display in Pacific time. This is fine as long as you anticipate this behavior.
Here’s an example of how something can go wrong. Suppose that
McDonald’s had its server on the East Coast and the opening time for
stores is stored as timetz
. A new McDonald’s opens up in
San Francisco. The new franchisee phones McDonald’s headquarters to add
its store to the master directory with an opening time of 7 a.m. The
data entry dude entered the information as he is told: 7 a.m. The East
Coast PostgreSQL server interprets this to mean 7 a.m. Eastern, and now
early risers in San Francisco are lining up at the door wondering why
they can’t get their McBreakfasts at 4 a.m. Being hungry is one thing,
but we can imagine many situations in which confusion over a difference
of three hours could mean life or death.
Given the pitfalls, why would anyone want to use time zone−aware data types? First, it does spare you from having to do time zone conversions manually. For example, if a flight leaves Boston at 8 a.m. and arrives in Los Angeles at 11 a.m., and your server is in Europe, you don’t want to have to figure out the offset for each time manually. You could just enter the data with the Boston and Los Angeles local times. There’s another convincing reason to use time zone−aware data types: the automatic handling of DST. With countries deviating more and more from one another in DST schedules, manually keeping track of DST changes for a globally used database would require a dedicated programmer who does nothing but keep up-to-date with the latest DST schedules and map them to geographic enclaves.
Here’s an interesting example: a traveling salesperson catches a flight home from San Francisco to nearby Oakland. When she boards the plane, the clock at the terminal reads 2012-03-11 1:50 a.m. When she lands, the clock in the terminal reads 2012-03-11 3:10 a.m. How long was the flight? The key to the solution is that the change to DST occurred during the flight—the clocks sprang forward. With time zone−aware timestamps, you get 20 minutes, to which is a plausible answer for a short flight across the Bay. We get the wrong answer if we don’t use time zone−aware timestamps:
SELECT
'2012-03-11 3:10 AM America/Los_Angeles'
::
timestamptz
-
'2012-03-11 1:50 AM America/Los_Angeles'
::
timestamptz
;
gives you 20 minutes, whereas:
SELECT
'2012-03-11 3:10 AM'
::
timestamp
-
'2012-03-11 1:50 AM'
::
timestamp
;
gives you 1 hour and 20 minutes.
Let’s drive the point home with more examples, using a Boston server. For Example 5-10, I input my time in Los Angeles local time, but because my server is in Boston, I get a time returned in Boston local time. Note that it does give me the offset but that is merely display information. The timestamp is internally stored in UTC.
SELECT
'2012-02-28 10:00 PM America/Los_Angeles'
::
timestamptz
;
2012-02-29 01:00:00-05
In Example 5-11, we are getting back a timestamp without time zone. So the answer you get when you run this same query will be the same as mine, regardless of where in the world you are.
SELECT
'2012-02-28 10:00 PM America/Los_Angeles'
::
timestamptz
AT
TIME
ZONE
'Europe/Paris'
;
2012-02-29 07:00:00
The query is asking: what time is it in Paris if it’s 2012-02-28 10:00 p.m. in Los Angeles? Note the absence of the UTC offset in the result. Also, notice how you can specify a time zone with its official name rather than just an offset. Visit Wikipedia for a list of official time zone names.
The inclusion of a temporal interval data type greatly eases date and time arithmetic in PostgreSQL. Without it, we’d have to create another family of functions or use a nesting of functions as many other databases do. With intervals, we can add and subtract timestamp data simply by using the arithmetic operators we’re intimately familiar with. The following examples demonstrate operators and functions used with date and time data types.
The addition operator (+
) adds an interval to
a timestamp:
SELECT
'2012-02-10 11:00 PM'
::
timestamp
+
interval
'1 hour'
;
2012-02-11 00:00:00
You can also add intervals:
SELECT
'23 hours 20 minutes'
::
interval
+
'1 hour'
::
interval
;
24:20:00
The subtraction operator (-
) subtracts an interval from a temporal type:
SELECT
'2012-02-10 11:00 PM'
::
timestamptz
-
interval
'1 hour'
;
2012-02-10 22:00:00-05
OVERLAPS
, demonstrated in Example 5-12, returns true
if two temporal ranges overlap. This is
an ANSI SQL predicate equivalent to the overlaps
function. OVERLAPS
takes four parameters, the first pair constituting one range and the
last pair constituting the other range. An overlap considers the time
periods to be half open, meaning that the start time is included but the
end time is outside the range. This is slightly different behavior from
the common BETWEEN
predicate, which considers both start and end to be included. This
quirk won’t make a difference unless one of your ranges is a fixed point
in time (a period for which start and end are identical). Watch out for
this if you’re an avid user of the OVERLAPS
function.
SELECT
(
'2012-10-25 10:00 AM'
::
timestamp
,
'2012-10-25 2:00 PM'
::
timestamp
)
OVERLAPS
(
'2012-10-25 11:00 AM'
::
timestamp
,
'2012-10-26 2:00 PM'
::
timestamp
)
AS
x
,
(
'2012-10-25'
::
date
,
'2012-10-26'
::
date
)
OVERLAPS
(
'2012-10-26'
::
date
,
'2012-10-27'
::
date
)
As
y
;
x |y ---+--- t |f
In addition to operators and predicates, PostgreSQL comes with functions supporting temporal types. A full listing can be found at Datetime Functions and Operators. We’ll demonstrate a sampling here.
Once again, we start with the versatile
generate_series
function. You can use this function with temporal types and interval
steps.
As you can see in Example 5-13, we can express dates in our local datetime format or the more global ISO yyyy-mm-dd format. PostgreSQL automatically interprets differing input formats. To be safe, we tend to stick with entering dates in ISO, because date formats vary from culture to culture, server to server, or even database to database.
SELECT
(
dt
-
interval
'1 day'
)::
date
As
eom
FROM
generate_series
(
'2/1/2012'
,
'6/30/2012'
,
interval
'1 month'
)
As
dt
;
eom ------------ 2012-01-31 2012-02-29 2012-03-31 2012-04-30 2012-05-31
Another popular activity is to extract or format parts of a
datetime value. Here, the functions date_part
and to_char
fit the
bill. Example 5-14 also
drives home the behavior of DST for a time zone−aware data type. We
intentionally chose a period that crosses a daylight saving switchover
in US/East
. Because the clock springs
forward at 2 a.m., the final row of the table reflects the new
time.
SELECT
dt
,
date_part
(
'hour'
,
dt
)
As
hr
,
to_char
(
dt
,
'HH12:MI AM'
)
As
mn
FROM
generate_series
(
'2012-03-11 12:30 AM'
,
'2012-03-11 3:00 AM'
,
interval
'15 minutes'
)
As
dt
;
dt | hr | mn -----------------------+----+---------- 2012-03-11 00:30:00-05 | 0 | 12:30 AM 2012-03-11 00:45:00-05 | 0 | 12:45 AM 2012-03-11 01:00:00-05 | 1 | 01:00 AM 2012-03-11 01:15:00-05 | 1 | 01:15 AM 2012-03-11 01:30:00-05 | 1 | 01:30 AM 2012-03-11 01:45:00-05 | 1 | 01:45 AM 2012-03-11 03:00:00-04 | 3 | 03:00 AM
By default, generate_series
assumes
timestamptz
if you don’t explicitly cast values to timestamp
.
Arrays play an important role in PostgreSQL. They are particularly useful in building
aggregate functions, forming IN
and ANY
clauses,
and holding intermediary values for morphing to other data types. In
PostgreSQL, every data type has a companion array type. If you define your
own data type, PostgreSQL creates a corresponding array type in the
background for you. For example, integer
has an integer array
type integer[]
, character
has a character array
type character[]
, and so forth. We’ll show you some useful
functions to construct arrays short of typing them in manually. We will
then point out some handy functions for array manipulations. You can get
the complete listing of array functions and operators in the Official
Manual: Array Functions and Operators.
The most rudimentary way to create an array is to type the elements:
SELECT
ARRAY
[
2001
,
2002
,
2003
]
As
yrs
;
If the elements of your array can be extracted from a query, you
can use the more sophisticated constructor function, array()
:
SELECT
array
(
SELECT
DISTINCT
date_part
(
'year'
,
log_ts
)
FROM
logs
ORDER
BY
date_part
(
'year'
,
log_ts
)
);
Although the array
function has to be used with a
query returning a single column, you can specify a composite type as the
output, thereby achieving multicolumn results. We demonstrate this in
“Custom and Composite Data Types”.
You can cast a string representation of an array to an array with syntax of the form:
SELECT
'{Alex,Sonia}'
::
text
[]
As
name
,
'{46,43}'
::
smallint
[]
As
age
;
name | age -------------+-------- {Alex,Sonia} | {46,43}
You can convert delimited strings to an array with the string_to_array
function, as demonstrated
in Example 5-15.
SELECT
string_to_array
(
'CA.MA.TX'
,
'.'
)
As
estados
;
estados ---------- {CA,MA,TX} (1 row)
array_agg
is an aggregate function that can take a set of any data type
and convert it to an array, as demonstrated in Example 5-16.
SELECT
array_agg
(
log_ts
ORDER
BY
log_ts
)
As
x
FROM
logs
WHERE
log_ts
BETWEEN
'2011-01-01'
::
timestamptz
AND
'2011-01-15'
::
timestamptz
;
x ------------------------------------------ {'2011-01-01', '2011-01-13', '2011-01-14'}
PostgreSQL 9.5 introduced array_agg
function support
for arrays. In prior versions if you wanted to aggregate rows of arrays
with array_agg, you’d get an error. array_agg support for arrays makes
it much easier to build multidimensional arrays from one-dimensional
arrays, as shown in Example 5-17.
SELECT
array_agg
(
f
.
t
)
FROM
(
VALUES
(
'{Alex,Sonia}'
::
text
[]),
(
'{46,43}'
::
text
[]
)
)
As
f
(
t
);
array_agg ---------------------- {{Alex,Sonia},{46,43}} (1 row)
In order to aggregate arrays, they must be of the same data type and the same dimension. To force that in Example 5-17, we cast the ages to text. We also have the same number of items in the arrays being aggregated: two people and two ages. Arrays with the same number of elements are called balanced arrays.
A common function used with arrays is unnest
, which allows you to expand
the elements of an array into a set of rows, as demonstrated in Example 5-18.
SELECT
unnest
(
'{XOX,OXO,XOX}'
::
char
(
3
)[])
As
tic_tac_toe
;
tic_tac_toe --- XOX OXO XOX
Although you can add multiple unnests to a single
SELECT
, if the number of resultant rows from each array is
not balanced, you may get some head-scratching results.
A balanced unnest, as shown in Example 5-19, yields three rows.
SELECT
unnest
(
'{three,blind,mice}'
::
text
[])
As
t
,
unnest
(
'{1,2,3}'
::
smallint
[])
As
i
;
t |i ------+- three |1 blind |2 mice |3
If you remove an element of one array so that you don’t have an equal number of elements in both, you get the result shown in Example 5-20.
SELECT
unnest
(
'{blind,mouse}'
::
varchar
[])
AS
v
,
unnest
(
'{1,2,3}'
::
smallint
[])
AS
i
;
v |i ------+- blind |1 mouse |2 blind |3 mouse |1 blind |2 mouse |3
Version 9.4 introduced a multiargument unnest
function that puts in null placeholders where the arrays are not
balanced. The main drawback with the new unnest
is that it can appear only in the
FROM
clause. Example 5-21 revisits our unbalanced
arrays using the version 9.4 construct.
SELECT
*
FROM
unnest
(
'{blind,mouse}'
::
text
[],
'{1,2,3}'
::
int
[])
AS
f
(
t
,
i
);
t | i -------+--- blind | 1 mouse | 2 <NULL> | 3
PostgreSQL also supports array slicing using the
start
:end
syntax.
It returns another array that is a subarray of the original. For
example, to return new arrays that just contain elements 2 through 4 of
each original array, type:
SELECT
fact_subcats
[
2
:
4
]
FROM
census
.
lu_fact_types
;
To glue two arrays together end to end, use the concatenation operator ||
:
SELECT
fact_subcats
[
1
:
2
]
||
fact_subcats
[
3
:
4
]
FROM
census
.
lu_fact_types
;
You can also add additional elements to an existing array as follows:
SELECT
'{1,2,3}'
::
integer
[]
||
4
||
5
;
The result is {1,2,3,4,5}
.
Elements in arrays are most commonly referenced using the index of the element.
PostgreSQL array indexes start at 1. If you try to access an element
above the upper bound, you won’t get an error—only NULL
will be returned. The next example grabs
the first and last element of our array column:
SELECT
fact_subcats
[
1
]
AS
primero
,
fact_subcats
[
array_upper
(
fact_subcats
,
1
)]
As
segundo
FROM
census
.
lu_fact_types
;
We used the array_upper
function to get the upper bound of the array. The second required
parameter of the function indicates the dimension. In our case, our
array is one-dimensional, but PostgreSQL does support multidimensional
arrays.
PostgreSQL has several operators for working with array data. We already saw the
concatenation operator (||
) for combining multiple arrays
into one or adding an element to an array in “Array Slicing and Splicing”. Arrays also support the
following comparison operators: =
, <>
,
<
, >
, @>
,
<@
, and &&
. These operators require
both sides of the operator to be arrays of the same array data type. If
you have a GiST or GIN index on your array column, the comparison
operators can utilize them.
The overlap operator (&&
) returns
true
if two arrays have any elements in common. Example 5-22 will list all records in our table
where the fact_subcats contains elements
OCCUPANCY STATUS
or For rent
.
SELECT
fact_subcats
FROM
census
.
lu_fact_types
WHERE
fact_subcats
&&
'{OCCUPANCY STATUS,For rent}'
::
varchar
[];
fact_subcats ----------------------------------------------------------- {S01,"OCCUPANCY STATUS","Total housing units"...} {S02,"OCCUPANCY STATUS","Total housing units"...} {S03,"OCCUPANCY STATUS","Total housing units"...} {S10,"VACANCY STATUS","Vacant housing units","For rent"...} (4 rows)
The equality operator (=
) returns true
only
if elements in all the arrays are equal and in the same order. If you
don’t care about order of elements, and just need to know whether all
the elements in one array appear as a subset of the other array, use the
containment operators (@>
, <@
). Example 5-23 demonstrates the difference between the contains (@>
) and
contained by (@<
) operators.
SELECT
'{1,2,3}'
::
int
[]
@>
'{3,2}'
::
int
[]
AS
contains
;
contains -------- t (1 row)
SELECT
'{1,2,3}'
::
int
[]
<@
'{3,2}'
::
int
[]
AS
contained_by
;
contained_by ------------ f (1 row)
Range data types represent data with a beginning and an end. PostgreSQL also rolled out many operators and functions to identify overlapping ranges, check to see whether a value falls inside the range, and combine adjacent smaller ranges into larger ranges. Prior to range types, we had to kludge our own functions. These often were clumsy and slow, and didn’t always produce the expected results. We’ve been so happy with ranges that we’ve converted all of our temporal tables to use them where possible. We hope you share our joy.
Range types replace the need to use two separate fields to represent
ranges. Suppose we want all integers between −2 and 2, but not including
2. The range representation would be [-2,2)
. The square
bracket indicates a range that is closed on that end, whereas a
parenthesis indicates a range that is open on that end. Thus,
[-2,2)
includes exactly four integers: −2, −1, 0, 1.
Similarly:
The range (-2,2]
includes four integers: -1,
0, 1, 2
.
The range (-2,2)
includes three integers: -1,
0, 1
.
The range [-2,2]
includes five integers: -2,
-1, 0, 1, 2
.
PostgreSQL makes a distinction between discrete and continuous ranges. A range of integers or dates is discrete because you can enumerate each value within the range. Think of dots on a number line. A range of numerics or timestamps is continuous, because an infinite number of values lies between the end points.
A discrete range has multiple representations. Our earlier example
of [-2,2)
can be represented in the following ways and
still include the same number of values in the range: [-2,1],
(-3,1], (-3,2), [-2,2)
. Of these four representations, the one
with [)
is considered the canonical form. There’s nothing
magical about closed-open ranges except that if everyone agrees to using
that representation for discrete ranges, we can easily compare among
many ranges without having to worry first about converting open to close
or vice versa. PostgreSQL canonicalizes all discrete ranges, for both
storage and display. So if you enter a date range as
(2014-1-5,2014-2-1]
, PostgreSQL rewrites it as
[2014-01-06,2014-02-02)
.
PostgreSQL comes with six built-in range types for numbers and datetimes:
int4range
, int8range
A range of integers. Integer ranges are discrete and subject to canonicalization.
numrange
A continuous range of decimals, floating-point numbers, or double-precision numbers.
daterange
A discrete date range of calendar dates without time zone awareness.
tsrange
, tstzrange
A continuous date and time (timestamp) range allowing for fractional
seconds. tstrange
is not time zone−aware;
tstzrange
is time zone−aware.
For number-like ranges, if either the start point or the end point
is left blank, PostgreSQL replaces it with a null. For practicality, you
can interpret the null to represent either -infinity
on the
left or infinity
on the right. In actuality, you’re bound
by the smallest and largest values for the particular data type. So a
int4range
of (,)
would
be [-2147483648,2147483647)
.
For temporal ranges, -infinity
and
infinity
are valid upper and lower bounds.
In addition to the built-in range types, you can create your own range types. When you do, you can set the range to be either discrete or continuous.
A range, regardless of type, is always comprised of two elements of the same type with the bounding condition denoted by brackets or parentheses, as shown in Example 5-24.
SELECT
'[2013-01-05,2013-08-13]'
:
:
daterange
;
SELECT
'(2013-01-05,2013-08-13]'
:
:
daterange
;
SELECT
'(0,)'
:
:
int8range
;
SELECT
'(2013-01-05 10:00,2013-08-13 14:00]'
:
:
tsrange
;
[2013-01-05,2013-08-14) [2013-01-06,2013-08-14) [1,) ("2013-01-05 10:00:00","2013-08-13 14:00:00"]
A date range between 2013-01-05 and 2013-08-13 inclusive. Note the canonicalization on the upper bound.
A date range greater than 2013-01-05 and less than or equal to 2013-08-13. Notice the canonicalization.
All integers greater than 0. Note the canonicalization.
A timestamp greater than 2013-01-05 10:00 AM and less than or equal to 2013-08-13 2 PM.
Datetimes in PostgreSQL can take on the values of
-infinity
and infinity
. For uniformity and
in keeping with convention, we suggest that you always use [ for the
former and ) for the latter as in [-infinity,
infinity)
.
Ranges can also be defined using range constructor functions, which go by the same name as the range and can take two or three arguments. Here’s an example:
SELECT
daterange
(
'2013-01-05'
,
'infinity'
,
'[]'
);
The third argument denotes the bound. If omitted, the open-close
[)
convention is used by default. We suggest that you
always include the third element for clarity.
Temporal ranges are popular. Suppose you have an employment
table that
stores employment history. Instead of creating separate columns for
start and end dates, you can design a table as shown in Example 5-25. In the example, we added an
index to the period
column to speed up queries using our
range column.
CREATE
TABLE
employment
(
id
serial
PRIMARY
KEY
,
employee
varchar
(
20
)
,
period
daterange
)
;
CREATE
INDEX
ix_employment_period
ON
employment
USING
gist
(
period
)
;
INSERT
INTO
employment
(
employee
,
period
)
VALUES
(
'Alex'
,
'[2012-04-24, infinity)'
:
:
daterange
)
,
(
'Sonia'
,
'[2011-04-24, 2012-06-01)'
:
:
daterange
)
,
(
'Leo'
,
'[2012-06-20, 2013-04-20)'
:
:
daterange
)
,
(
'Regina'
,
'[2012-06-20, 2013-04-20)'
:
:
daterange
)
;
Two range operators tend to be used most often: overlap
(&&
) and contains (@>
). Those are
the ones we’ll cover. To see the full catalog of range operators, go to
Range Operators.
As the name suggests, the overlap operator &&
returns true
if two ranges have any values in
common. Example 5-26 demonstrates this
operator and puts to use the string_agg
function
for aggregating the list of employees into a single text
field.
SELECT
e1
.
employee
,
string_agg
(
DISTINCT
e2
.
employee
,
', '
ORDER
BY
e2
.
employee
)
As
colleagues
FROM
employment
As
e1
INNER
JOIN
employment
As
e2
ON
e1
.
period
&&
e2
.
period
WHERE
e1
.
employee
<>
e2
.
employee
GROUP
BY
e1
.
employee
;
employee | colleagues ---------+------------------- Alex | Leo, Regina, Sonia Leo | Alex, Regina Regina | Alex, Leo Sonia | Alex
In the contains operator (@>
), the first argument is a range and the second is a value. If
the second is within the first, the contains operator returns true.
Example 5-27 demonstrates its
use.
SELECT
employee
FROM
employment
WHERE
period
@>
CURRENT_DATE
GROUP
BY
employee
;
employee -------- Alex
The reverse of the contains operator is the contained operator (<@
), whose first
argument is the value and the second the range.
PostgreSQL provides JSON (JavaScript Object Notation) and many support functions. JSON has become the most popular data interchange format for web applications. Version 9.3 significantly beefed up JSON support with new functions for extracting, editing, and casting to other data types. Version 9.4 introduced the JSONB data type, a binary form of JSON that can also take advantage of indexes. Version 9.5 introduced more functions for jsonb, including functions for setting elements in a jsonb object. Version 9.6 introduced the jsonb_insert function for inserting elements into an existing jsonb array or adding a new key value.
To create a table to store JSON, define a column as a json
type:
CREATE
TABLE
persons
(
id
serial
PRIMARY
KEY
,
person
json
);
Example 5-28 inserts JSON data. PostgreSQL automatically validates the input to make sure what you are adding is valid JSON. Remember that you can’t store invalid JSON in a JSON column, nor can you cast invalid JSON to a JSON data type.
INSERT
INTO
persons
(
person
)
VALUES
(
'{
"name":"Sonia",
"spouse":
{
"name":"Alex",
"parents":
{
"father":"Rafael",
"mother":"Ofelia"
},
"phones":
[
{
"type":"work",
"number":"619-722-6719"
},
{
"type":"cell",
"number":"619-852-5083"
}
]
},
"children":
[
{
"name":"Brandon",
"gender":"M"
},
{
"name":"Azaleah",
"girl":true,
"phones": []
}
]
}'
);
The easiest way to traverse the hierarchy of a JSON object is by using pointer symbols. Example 5-29 shows some common usage.
SELECT
person
->
'name'
FROM
persons
;
SELECT
person
->
'spouse'
->
'parents'
->
'father'
FROM
persons
;
You can also write the query using a path array as in the following example:
SELECT
person
#>
array
[
'spouse'
,
'parents'
,
'father'
]
FROM
persons
;
Notice that you must use the #> pointer symbol if what comes after is a path array.
To penetrate JSON arrays, specify the array index. JSON arrays is zero-indexed, unlike PostgreSQL arrays, whose indexes start at 1.
SELECT
person
->
'children'
->
0
->
'name'
FROM
persons
;
And the path array equivalent:
SELECT
person
#>
array
[
'children'
,
'0'
,
'name'
]
FROM
persons
;
All queries in the prior examples return the value as JSON primitives (numbers, strings, booleans). To return the text representation, add another greater-than sign as in the following examples:
SELECT
person
->
'spouse'
->
'parents'
->>
'father'
FROM
persons
;
SELECT
person
#>>
array
[
'children'
,
'0'
,
'name'
]
FROM
persons
;
If you are chaining the -> operator, only the very last one can be a ->> operator.
The json_array_elements function takes a JSON array and returns each element of the array as a separate row as in Example 5-30.
SELECT
json_array_elements
(
person
->
'children'
)
->>
'name'
As
name
FROM
persons
;
name ------- Brandon Azaleah (2 rows)
We strongly encourage you to use pointer symbols when drilling down into a JSON object. The syntax is more succinct and you can use the same operators as for JSONB (which we’ll cover shortly). PostgreSQL does offer functional equivalents if you need them: json_extract_path is a variadic function (functions with an unlimited number of arguments). The first argument is always the JSON object you are trying to navigate; subsequent parameters are the key value for each tier of the hierarchy. The equivalent to ->> and #>> is json_extract_path_text.
In addition to querying JSON data, you can convert other data to JSON. In these next examples, we’ll demonstrate the use of JSON built-in functions to create JSON objects.
Example 5-31 demonstrates the use
of row_to_json
to convert a subset of columns
in each record from the table we created and loaded in Example 5-28.
SELECT
row_to_json
(
f
)
As
x
FROM
(
SELECT
id
,
json_array_elements
(
person
->
'children'
)
->>
'name'
As
cname
FROM
persons
)
As
f
;
x -------------------------- {"id":1,"cname":"Brandon"} {"id":1,"cname":"Azaleah"} (2 rows)
To output each row in our persons
table as JSON:
SELECT
row_to_json
(
f
)
As
jsoned_row
FROM
persons
As
f
;
The use of a row as an output field in a query is a feature unique
to PostgreSQL. It’s handy for creating complex JSON objects. We describe
it further in “Composite Types in Queries”, and Example 7-20 demonstrates the use of array_agg
and array_to_json
to output a set of rows as a single JSON object. In version
9.3 we have at our disposal the json_agg
function. We demonstrate its use
in Example 7-21.
New in PostgreSQL 9.4 is the jsonb
data type. It is handled through
the same operators as those for the json
type, and
similarly named functions, plus several additional ones.
jsonb
performance is much better than json
performance because jsonb
doesn’t
need to be reparsed during operations. There are a couple of key
differences between the jsonb
and json
data
types:
jsonb
is internally stored as a binary object and
does not maintain the formatting of the original JSON text as the
json
data type does. Spaces aren’t preserved, numbers
can appear slightly different, and attributes become sorted. For
example, a number input as e-5
would be converted to
its decimal representation.
jsonb
does not allow duplicate keys and silently
picks one, whereas the json
type preserves duplicates.
This is demonstrated in Michael Paquier’s article “Manipulating jsonb data by abusing of
key uniqueness”.
jsonb
columns can be directly indexed using the
GIN index method (covered in “Indexes”),
whereas json
requires a functional index to extract key
elements.
To demonstrate these concepts, we’ll create another persons
table, replacing the json
column with a
jsonb
:
CREATE
TABLE
persons_b
(
id
serial
PRIMARY
KEY
,
person
jsonb
);
To insert data into our new table, we would repeat Example 5-28.
So far, working with JSON and binary JSON has been the same. Differences appear when you query. To make the binary JSON readable, PostgreSQL converts it to a canonical text representation, as shown in Example 5-32.
SELECT
person
As
b
FROM
persons_b
WHERE
id
=
1
;
SELECT
person
As
j
FROM
persons
WHERE
id
=
1
;
b --------------------------------------------------------------------------------- {"name": "Sonia", "spouse": {"name": "Alex", "phones": [{"type": "work", "number": "619-722-6719"}, {"type": "cell", "number": "619-852-5083"}], "parents": {"father": "Rafael", "mother": "Ofelia"}}, "children": [{"name": "Brandon", "gender": "M"}, {"girl": true, "name": "Azaleah", "phones": []}]} (1 row)
j --------------------------------------------- { "name":"Sonia", "spouse": { "name":"Alex", "parents": { "father":"Rafael", "mother":"Ofelia" }, "phones": [ { "type":"work", "number":"619-722-6719"+ }, { "type":"cell", "number":"619-852-5083"+ } ] }, "children": [ { "name":"Brandon", "gender":"M" }, { "name":"Azaleah", "girl":true, "phones": [] } ] } (1 row)
jsonb
reformats input and removes whitespace.
Also, the order of attributes is not maintained from the
insert.
json
maintains input whitespace and the order of
attributes.
jsonb
has similarly named functions as
json
, plus some additional ones. So, for example, the
json
family of functions such as json_extract_path_text
and
json_each
are matched in jsonb
by
jsonb_extract_path_text
, jsonb_each
,
etc. However, the equivalent operators are the same, so
you will find that the examples in “Querying JSON”
work largely the same without change for the jsonb
type—just replace the table name and json_array_elements
with
jsonb_array_elements
.
In addition to the operators supported by json
, jsonb
has
additional comparator operators for equality (=
), contains (@>
), contained
(<@
), key exists (?
), any of array of keys exists (?|
), and all of array of keys exists
(?&
).
So, for example, to list all people that have a child named Brandon, use the contains operator as demonstrated in Example 5-33.
SELECT
person
->>
'name'
As
name
FROM
persons_b
WHERE
person
@>
'{"children":[{"name":"Brandon"}]}'
;
name ----- Sonia
These additional operators provide very fast checks when you
complement them with a GIN index on the jsonb
column:
CREATE
INDEX
ix_persons_jb_person_gin
ON
persons_b
USING
gin
(
person
);
We don’t have enough records in our puny table for the index to kick in, but for more rows, you’d see that Example 5-33 utilizes the index.
PostgreSQL 9.5 introduced native jsonb concatenation (||
) and
subtraction operators (-
, #-
) as well as
companion functions for setting data. These operators do not exist for
the json datatype. To be able to accomplish these tasks in prior
versions, you’d have to lean on “Writing PL/V8, PL/CoffeeScript, and PL/LiveScript Functions” to do
the work.
The concatenation operator can be used to add and replace attributes of a
jsonb object. In Example 5-34 we add an
address attribute to the Gomez family and use the RETURNING
construct covered in “Returning Affected Records to the User” to
return the updated value. The new value has an address attribute.
UPDATE
persons_b
SET
person
=
person
||
'{"address": "Somewhere in San Diego, CA"}'
::
jsonb
WHERE
person
@>
'{"name":"Sonia"}'
RETURNING
person
;
profile ------------------------------------------------------------------------------- {"name": "Sonia", ... "address": "Somewhere in San Diego, CA", "children": ...} (1 row) UPDATE 1
Because JSONB requires that keys be unique, if you try to add a duplicate key, the original value will be replaced instead. So to update with a new address, we would repeat the exercise in Example 5-34, but replacing Somewhere in San Diego, CA with something else.
If we decided we no longer wanted an address, we could use the
-
as shown in Example 5-35.
UPDATE
persons_b
SET
person
=
person
-
'address'
WHERE
person
@>
'{"name":"Sonia"}'
;
The simple -
operator works for first-level elements, but what if you wanted to remove
an attribute from a particular member? This is when you’d use the #-
operator. #-
takes an
array of text values that denotes the path of the element you want to
remove. In Example 5-36 we
remove the girl
designator of Azaleah.
UPDATE
persons_b
SET
person
=
person
#-
'{children,1,girl}'
::
text
[]
WHERE
person
@>
'{"name":"Sonia"}'
RETURNING
person
->
'children'
->
1
;
{"name": "Azaleah", "phones": []}
When removing elements from an array, you need to denote the
index. Because JavaScript indexes start at 0, to remove an element from
the second child, we use 1
instead of 2. If we wanted to
remove Azaleah entirely, we would have used
'{children,1}'::text[]
.
To add a gender attribute, or replace one that was previously set,
we can use the jsonb_set
function as shown in Example 5-37.
UPDATE
persons_b
SET
person
=
jsonb_set
(
person
,
'{children,1,gender}'
::
text
[],
'"F"'
::
jsonb
,
true
)
WHERE
person
@>
'{"name":"Sonia"}'
;
jsonb_set
takes three arguments of form
jsonb_set(
.
If you set jsonb_to_update
,
text_array_path
,
new_jsonb_value
,allow_creation
)allow_creation
to
false
when the property did not already exist, the
statement will return an error.
The XML data type, similar to JSON, is “controversial” in a relational database
because it violates the principles of normalization. Nonetheless, all of
the high-end relational database products (IBM DB2, Oracle, SQL Server)
support XML. PostgreSQL also jumped on the bandwagon and offers plenty of
functions to boot. (We’ve authored many articles on working with XML in
PostgreSQL.) PostgreSQL comes packaged with functions for
generating, manipulating, and parsing XML data. These are outlined in
XML Functions. Unlike the
jsonb
type, there is currently no direct index support for
it. So you need to use functional indexes to index subparts, similar to
what you can do with the plain json
type.
When you create a column of the xml
data type, PostgreSQL
automatically ensures that only valid XML values populate the rows. This
is what distinguishes an XML column from just any text column. However,
the XML is not validated against any Document Type Definition (DTD) or XML Schema Definition
(XSD), even if it is specified in the XML document. To freshen up
on what constitutes valid XML, Example 5-38
shows you how to append XML data to a table by declaring a column as
xml
and inserting into it as usual.
CREATE
TABLE
families
(
id
serial
PRIMARY
KEY
,
profile
xml
);
INSERT
INTO
families
(
profile
)
VALUES
(
'<family name="Gomez">
<member><relation>padre</relation><name>Alex</name></member>
<member><relation>madre</relation><name>Sonia</name></member>
<member><relation>hijo</relation><name>Brandon</name></member>
<member><relation>hija</relation><name>Azaleah</name></member>
</family>'
);
Each XML value could have a different XML structure. To enforce
uniformity, you can add a check constraint, covered in “Check Constraints”, to the XML column. Example 5-39 ensures that all
family
has at least one relation
element. The
'/family/member/relation'
is XPath syntax, a basic way to
refer to elements and other parts of XML.
ALTER
TABLE
families
ADD
CONSTRAINT
chk_has_relation
CHECK
(
xpath_exists
(
'/family/member/relation'
,
profile
));
If we then try to insert something like:
INSERT
INTO
families
(
profile
)
VALUES
(
'<family name="HsuObe"></family>'
);
we will get this error: ERROR: new row for relation
"families" violates check constraint "chk_has_relation"
.
For more involved checks that require checking against DTD or XSD, you’ll need to resort to writing functions and using those in the check constraint, because PostgreSQL doesn’t have built-in functions to handle those kinds of checks.
To query XML, the xpath
function is really useful. The first argument is an XPath query, and
the second is an xml
object. The output is an array of XML
elements that satisfies the XPath query. Example 5-40 combines xpath
with unnest
to return all
the family members. unnest
unravels the array into a row set. We then cast the XML fragment to
text.
SELECT
ordinality
AS
id
,
family
,
(
xpath
(
'/member/relation/text()'
,
f
)
)
[
1
]
:
:
text
As
relation
,
(
xpath
(
'/member/name/text()'
,
f
)
)
[
1
]
:
:
text
As
mem_name
FROM
(
SELECT
(
xpath
(
'/family/@name'
,
profile
)
)
[
1
]
:
:
text
As
family
,
f
.
ordinality
,
f
.
f
FROM
families
,
unnest
(
xpath
(
'/family/member'
,
profile
)
)
WITH
ORDINALITY
AS
f
)
x
;
id | family | relation | mem_name ----+--------+----------+---------- 1 | Gomez | padre | Alex 2 | Gomez | madre | Sonia 3 | Gomez | hijo | Brandon 4 | Gomez | hija | Azaleah (4 rows)
Get the text element in the relation
and name
tags of each member
element. We need to use array
subscripting because xpath
always returns an array,
even if only one element is returned.
Get the name attribute from family
root. For this we use
@
.attribute_name
Break the result of the SELECT into the subelements <member>
, <relation>
, </relation>
, <name>
, </name>
, and </member>
tags. The slash is a way
of getting at subtag elements. For example,
xpath('/family/member', 'profile')
will return an array
of all members in each family that is defined in a profile. The @
sign is used to select attributes of an element. So, for
example, family/@name
returns the
name attribute of a family
. By
default, xpath
always returns an
element, including the tag part. The text()
forces a
return of just the text body of an element.
New in version 10 is the ANSI-SQL standard XMLTABLE construct. XMLTABLE converts text of XML into individual rows and columns based on some defined transformation. We’ll repeat Example 5-40 using XMLTABLE.
SELECT
xt
.
*
FROM
families
,
XMLTABLE
(
'/family/member'
PASSING
profile
COLUMNS
id
FOR
ORDINALITY
,
family
text
PATH
'../@name'
,
relation
text
NOT
NULL
,
member_name
text
PATH
'name'
NOT
NULL
)
AS
xt
;
id | family | relation | mem_name ----+--------+----------+---------- 1 | Gomez | padre | Alex 2 | Gomez | madre | Sonia 3 | Gomez | hijo | Brandon 4 | Gomez | hija | Azaleah (4 rows)
The first part is an XML path element that defines the row.
The word PASSING is followed by the table column to parse out
rows. This column has to be of type xml
. We use the
families.profile
column of our families table.
The COLUMNS component should define the list of columns to be parsed out of the xml.
Similar to WITH ORDINALITY in conjunction with set-returning functions, you can use FOR ORDINALITY to assign numeric order to each record.
You can use ../ to move up a level above the base of the
row. In this case we use ./@name to get the family name, which is
one level above family/member. The @ is used to denote this is an
attribute (something of form name='a value'
) and not
an element.
If a path element matches the name of your defined column, you
don’t need to specify the PATH. In this case, because /family/member/relation matches our
column name relation
, we can
skip the PATH clause.
I’m sure you’ve seen websites where you can search by typing in keywords. An ecommerce site will bring up a list of matching products; a film site will bring up a list of matching movies; a knowledgebase site will bring up matching questions and answers, etc.
To search textual data by keywords, you have at your disposal
the like
or ilike
(case
insensitive) commands. You can also avail yourself of powerful regular
expression and Soundex searches. But both of these methods stop short of
offering natural language−based match conditions. For example, if you’re
looking for LGBT movies and type that abbreviation into your search,
you’re going to miss movies described as lesbian, gay, bisexual, or
transgendered. If you type in the search term lots of steamy sex
scenes, you may end up with nothing unless the description very
closely matches what you typed in.
FTS is a suite of tools that adds a modicum of “intelligence” to your searches. Though it’s far from being able to read your mind, it can find words that are close in meaning, rather than spelling. FTS is packaged into PostgreSQL, with no additional installation necessary.
At the core of FTS is an FTS configuration. The configuration codifies the rules under which match will occur by referring to one or more dictionaries. For instance, if your dictionary contains entries that equate the words love, romance, infatuation, lust, then any search by one of the words will find matches with any of the words. Dictionaries may also equate words with the same stem. For example, love, loving, and loved share a common stem. A dictionary could equate all principle parts of a verb; for example, eat, eats, ate, and eaten could be considered the same.
A dictionary can also list stop words. These are usually parts of speech that add little to the meaning. Articles, conjunctions, prepositions, and pronouns such as a, the, on, and that often make up the list of stop words.
Beyond matching synonyms and pruning stop words, FTS can be used to rank searches. FTS can utilize the proximity of words to each other and the frequency of terms in text to rank search results. For example, if you’re interested in viewing movies where sex is depicted with smoking, you could search for the two words sex and smoking, but also specify that the two words must be two words apart and rank higher if they appear in the title. And so they smoked after sex would hit, whereas sex took place in a hotel, which has a foyer for smoking guests would miss. FTS can apply unequal weights to the places where the sought-after words appear in the text. For instance, if you have a movie where the word sex appears in either the title or the byline, you could make this movie rank higher than movies where sex is only in the description.
Most PostgreSQL distributions come packaged with over 10 FTS configurations. All these are installed in the pg_catalog schema.
To see the listing of installed FTS configurations, run the query
SELECT cfgname FROM pg_ts_config;
. Or use the
dF
command in psql. A typical list follows:
cfgname ---------- simple danish dutch english finnish french german hungarian italian norwegian portuguese romanian russian spanish swedish turkish (16 rows)
If you need to create your own configurations or dictionaries, refer to PostgreSQL Manual: Full Text Search Configuration and PostgreSQL Manual: Full Text Search Dictionaries.
You’re not limited to built-in FTS configurations. You can create your own. But before you do, you may wish to see what other users have already created that may suit your needs. If your text is medical-related, you may be able to find a configuration with dictionaries chock full of specialized anatomy terms. If your text is in Spanish, find a configuration that tailors to your particular dialect of Spanish.
Once you locate a configuration that you’d like added to your arsenal, installation is quite simple and usually doesn’t require additional compilation. We demonstrate by installing the popular hunspell configuration.
Start by downloading hunspell configurations from hunspell_dicts. You’ll be greeted by hunspell for many different languages. We’ll go with hunspell_en_us:
Download everything in the folder.
Copy en_us.affix and en_us.dict to your PostgreSQL installation directory share/tsearch_data.
Copy the hunspell_en_us--*.sql and hunspell_en_us.control files to your PostgreSQL installation directory share/extension folder.
CREATE EXTENSION hunspell_en_us SCHEMA pg_catalog;
From psql, if you now run Example 5-42, you’ll see details of the hunspell configuration and dictionary we just installed.
dF+ english_hunspell;
Text search configuration "pg_catalog.english_hunspell" Parser: "pg_catalog.default" Token | Dictionaries ----------------+------------------------------- asciihword | english_hunspell,english_stem asciiword | english_hunspell,english_stem email | simple file | simple float | simple host | simple hword | english_hunspell,english_stem hword_asciipart | english_hunspell,english_stem hword_numpart | simple hword_part | english_hunspell,english_stem int | simple numhword | simple numword | simple sfloat | simple uint | simple url | simple url_path | simple version | simple word | english_hunspell,english_stem
Keep in mind that not all FTS configurations install in the same way. Read the instructions.
Contrast that output to the built-in English configuration in Example 5-43, which gives you the dictionaries used by the English configuration.
dF+ english;
Text search configuration "pg_catalog.english" Parser: "pg_catalog.default" Token | Dictionaries ----------------+-------------- asciihword | english_stem asciiword | english_stem email | simple file | simple float | simple host | simple hword | english_stem hword_asciipart | english_stem hword_numpart | simple hword_part | english_stem int | simple numhword | simple numword | simple sfloat | simple uint | simple url | simple url_path | simple version | simple word | english_stem
The only difference between the two is that hunspell draws from an additional dictionary.
Not sure which configuration is the default? Run:
SHOW default_text_search_config;
To replace the default with another, run:
ALTER DATABASE postgresql_book SET default_text_search_config = 'pg_catalog.english';
This replacement takes place at the database level, but as with most PostgreSQL configuration settings, you can make the change at the server, user, or session levels.
A text column must be vectorized before FTS can search against it. The resultant vector column is a tsvector data type. To create a tsvector from text, you must specify the FTS configuration to use. The vectorization reduces the original text to a set of word skeletons, referred to as lexemes, by removing stop words. For each lexeme, the TSVector records where in the original text it appears. The more frequently a lexeme appears, the higher the weight. Each lexeme therefore is imbued with at least one position, much like a vector in the physical sense.
Use the to_tsvector function to vectorize a blob of text. This function will resort to the default FTS configuration unless you specify another.
Example 5-44 shows how TSVectors differ depending on which FTS configuration was used in their construction.
SELECT
c
.
name
,
CASE
WHEN
c
.
name
=
'default'
THEN
to_tsvector
(
f
.
t
)
ELSE
to_tsvector
(
c
.
name
::
regconfig
,
f
.
t
)
END
As
vect
FROM
(
SELECT
'Just dancing in the rain. I like to dance.'
::
text
)
As
f
(
t
),
(
VALUES
(
'default'
),(
'english'
),(
'english_hunspell'
),(
'simple'
)
)
As
c
(
name
);
name | vect -----------------+----------------------------------------------------------------------------- default | 'danc':2,9 'like':7 'rain':5 english | 'danc':2,9 'like':7 'rain':5 english_hunspell | 'dance':2,9 'dancing':2 'like':7 'rain':5 simple | 'dance':9 'dancing':2 'i':6 'in':3 'just':1 'like':7 'rain':5 'the':4 'to':8 (4 rows)
Example 5-44 demonstrates how four different FTS configurations result in different vectors. Note how the English and Hunspell configurations remove all stop words, such as just and to. English and Hunspell also convert words to their normalized form as dictated by their dictionaries, so dancing becomes danc and dance, respectively. The simple configuration has no concept of stemming and stop words.
The to_tsvector function returns where each lexeme appears in the
text. So, for example, 'danc':2,9
means that
dancing and dance appear as
the second and the ninth words.
To incorporate FTS into your database, add a tsvector column to your table. You then either schedule the tsvector column to be updated regularly, or add a trigger to the table so that whenever relevant fields update, the tsvector field recomputes.
For our examples, we gathered fictitious movie data. Load the tables from psql using the file.sql script as follows:
encoding utf8; i film.sql
Next, we add and compute a tsvector column to the film table as shown in Example 5-45.
ALTER
TABLE
film
ADD
COLUMN
fts
tsvector
;
UPDATE
film
SET
fts
=
setweight
(
to_tsvector
(
COALESCE
(
title
,
''
)),
'A'
)
||
setweight
(
to_tsvector
(
COALESCE
(
description
,
''
)),
'B'
);
CREATE
INDEX
ix_film_fts_gin
ON
film
USING
gin
(
fts
);
Example 5-45 vectorizes the title and description columns and stores the vector in a newly created tsvector column. To speed up searches, we add a GIN index on the tsvector column. GIN is a lossless index. You can also add a GiST index on a vector column. GiST is lossy and slower to search but builds quicker and takes up less disk space. We explore indexes in more detail in “Indexes”.
By populating the fts column, we’ve
introduced two new constructs, the setweight
function and the concatenation operator (||
), to
tsvector.
To distinguish the relative importance of different lexemes, you could assign a weight to each. The weights must be A, B, C, or D, with A ranking highest in importance. In Example 5-45, we assigned A to lexemes culled from the title and B to lexemes from the description. If our search term matches a lexeme from the title, we deem the match to be more relevant than a match from the description of the movie.
TSVectors can be formed from other tsvectors using the
concatenation (||
) operator. We used it here to combine the
title and description into a single tsvector. This way when we search,
we have to contend with only a single column.
Should data change in one of the basis columns forming the tsvector, you must re-vectorize. To avoid having to manually run to_tsvector every time data changes, create a trigger that responds to updates. In the trigger, use the handy tsvector_update_trigger function as shown in Example 5-46.
CREATE
TRIGGER
trig_tsv_film_iu
BEFORE
INSERT
OR
UPDATE
OF
title
,
description
ON
film
FOR
EACH
ROW
EXECUTE
PROCEDURE
tsvector_update_trigger
(
fts
,
'pg_catalog.english'
,
title
,
description
);
Example 5-46 reacts to an insert or update in the title or description by revectoring the fts column. One shortcoming though: tsvector_update_trigger does not support weighting.
A FTS, or any text search for that matter, has two components: the searched text and the search terms. For FTS to work, both must be vectorized. We have already seen how to vectorize the searched text to create tsvector columns. We now show you how to vectorize the search terms.
FTS refers to vectorized search terms as tsqueries, and PostgreSQL offers several functions that will convert plain-text search terms to tsqueries: to_tsquery, plainto_tsquery, and phraseto_tsquery. The latter is a new function in 9.6 and takes the ordering of words in the search term into consideration.
tsqueries are normally created on the fly rather than being stored in a table. However, if you are building a system where people can save their queries and run them, you could define a tsquery column in a table.
Example 5-47 shows the output using the to_tsquery functions against two configurations: the default English configuration and the Hunspell configuration.
SELECT
to_tsquery
(
'business & analytics'
);
to_tsquery ----------------- 'busi' & 'analyt'
SELECT
to_tsquery
(
'english_hunspell'
,
'business & analytics'
);
to_tsquery -------------------------------- ('business' | 'busy') & 'analyt'
Both examples are akin to searching for text containing the words
business and analytics. The
and operator (&
) means that both words must appear in the searched text. The or
operator (|
) means one or both of the words must appear in the searched text.
If the configuration in use finds multiple stems for a word, they are
stitched together by the or operator.
You should use the same FTS configuration as the one you used to build the tsvector.
A slight variant of to_tsquery is plain_totsquery. This function automatically inserts the and operator between words for you, saving you a few key clicks. See Example 5-48.
SELECT
plainto_tsquery
(
'business analytics'
);
plainto_tsquery ----------------- 'busi' & 'analyt'
to_tsquery and plainto_tsquery look only at words, not their sequence. So business analytics and analytics business produce the same tsquery. This is a shortcoming because you’re limited to searching by single words only. Version 9.6 addressed this with the function phraseto_tsquery. In Example 5-49, the phraseto_tsquery vectorizes the words, inserting the distance operator between the words. This means that the searched text must contain the words business and analytics in that order, upgrading a word search to a phrase search.
SELECT
phraseto_tsquery
(
'business analytics'
);
phraseto_tsquery ------------------- 'busi' <-> 'analyt'
SELECT
phraseto_tsquery
(
'english_hunspell'
,
'business analytics'
);
phraseto_tsquery --------------------------------------------- 'business' <-> 'analyt' | 'busy' <-> 'analyt'
You can also cast text to tsquery without using any functions, as
in 'business & analytics'::tsquery
. However, with
casts, words are not replaced with lexemes and are taken
literally.
TSQueries can be combined using the or operator (||
) or
the and operator
(&&
). The expression tsquery1 ||
tsquery2
means matching text must satisfy either tsquery1
or tsquery2. The expression tsquery1
&& tsquery2
means matching text must satisfy both
tsquery1 and tsquery2.
Examples of each are shown in Example 5-50.
SELECT
plainto_tsquery
(
'business analyst'
)
||
phraseto_tsquery
(
'data scientist'
);
tsquery ------------------------------------------- 'busi' & 'analyst' | 'data' <-> 'scientist'
SELECT
plainto_tsquery
(
'business analyst'
)
&&
phraseto_tsquery
(
'data scientist'
);
tsquery -------------------------------------------- 'busi' & 'analyst' & ('data' <-> 'scientist')
tsqueries and tsvectors have additional operators for doing things like determining if one is a subset of another, and several other functions. All this is detailed in PostgreSQL Manual: Text Search Functions and Operators.
We have created a tsvector from our text; we have created a tsquery from our search
terms. Now, we can perform an FTS. We do so by using the @@
operator. Example 5-51 demonstrates
it.
SELECT
left
(
title
,
50
)
As
title
,
left
(
description
,
50
)
as
description
FROM
film
WHERE
fts
@@
to_tsquery
(
'hunter & (scientist | chef)'
)
AND
title
>
''
;
title | description -----------------------+--------------------------------------------------- ALASKA PHANTOM | A Fanciful Saga of a Hunter And a Pastry Chef who CAUSE DATE | A Taut Tale of a Explorer And a Pastry Chef who mu CINCINATTI WHISPERER | A Brilliant Saga of a Pastry Chef And a Hunter who COMMANDMENTS EXPRESS | A Fanciful Saga of a Student And a Mad Scientist w DAUGHTER MADIGAN | A Beautiful Tale of a Hunter And a Mad Scientist w GOLDFINGER SENSIBILITY | A Insightful Drama of a Mad Scientist And a Hunter HATE HANDICAP | A Intrepid Reflection of a Mad Scientist And a Pio INSIDER ARIZONA | A Astounding Saga of a Mad Scientist And a Hunter WORDS HUNTER | A Action-Packed Reflection of a Composer And a Mad (9 rows)
Example 5-51 finds all films with a title or description containing the word hunter and either the word scientist, or the word chef, or both.
If you are running PostgreSQL 9.6, you can specify the proximity and order of words. See Example 5-52.
SELECT
left
(
title
,
50
)
As
title
,
left
(
description
,
50
)
as
description
FROM
film
WHERE
fts
@@
to_tsquery
(
'hunter <4> (scientist | chef)'
)
AND
title
>
''
;
title | description -----------------+--------------------------------------------------- ALASKA PHANTOM | A Fanciful Saga of a Hunter And a Pastry Chef who DAUGHTER MADIGAN | A Beautiful Tale of a Hunter And a Mad Scientist w (2 rows)
Example 5-52 requires that the word hunter precede scientist or chef by exactly four words.
FTS includes functions for ranking results. These functions are ts_rank and ts_rank_cd. ts_rank considers only the frequency of terms and weights, while ts_rank_cd (cd stands for coverage density) also considers the position of the search term within the searched text. If lexemes are found closer together, the result ranks higher. ts_rank_cd is meaningful only if you have position markers in your tsvector; otherwise, it returns zero. The frequency with which a search term appears also depends on position markers. So the ts_rank function will consider only weights if positional markers are missing. By default, ts_rank and ts_rank_cd apply the weights 0.1, 0.2, 0.4, and 1.0, respectively, for D, C, B, and A. Example 5-53 follows the default order.
SELECT
title
,
left
(
description
,
50
)
As
description
,
ts_rank
(
fts
,
ts
)::
numeric
(
10
,
3
)
AS
r
FROM
film
,
to_tsquery
(
'english'
,
'love & (wait | indian | mad)'
)
AS
ts
WHERE
fts
@@
ts
AND
title
>
''
ORDER
BY
r
DESC
;
title | description | r --------------+----------------------------------------------------+------ INDIAN LOVE | A Insightful Saga of a Mad Scientist And a Mad Sci | 0.999 LAWRENCE LOVE | A Fanciful Yarn of a Database Administrator And a | 0.252 (2 rows)
Let’s suppose we wish to retrieve a field only if the search terms appear in the title. For this situation we would assign 1 to the title field and 0 to all others. Example 5-54 repeats Example 5-53, passing in an array of weights.
SELECT
left
(
title
,
40
)
As
title
,
ts_rank
(
'{0,0,0,1}'
::
numeric
[],
fts
,
ts
)::
numeric
(
10
,
3
)
AS
r
,
ts_rank_cd
(
'{0,0,0,1}'
::
numeric
[],
fts
,
ts
)::
numeric
(
10
,
3
)
As
rcd
FROM
film
,
to_tsquery
(
'english'
,
'love & (wait | indian | mad )'
)
AS
ts
WHERE
fts
@@
ts
AND
title
>
''
ORDER
BY
r
DESC
;
title | r | rcd --------------+-------+------ INDIAN LOVE | 0.991 | 1.000 LAWRENCE LOVE | 0.000 | 0.000 (2 rows)
Notice how in Example 5-54 the second entry has a ranking of zero because the title does not contain all the words to satisfy the tsquery.
If performance is a concern, you should explicitly declare the
FTS configuration in queries instead of allowing the default behavior.
As noted in Some FTS Tricks by
Oleg Bartunov, you can achieve twice the speed by using
to_tsquery('english','social & (science |
scientist)')
in lieu of to_tsquery('social & (science
| scientist)')
.
By default, vectorization adds markers (location of the lexemes within the vector) and optionally weights (A, B, C, D). If your searches care only whether a particular term can be found, regardless of where it is in the text, how frequently it occurs, or its prominence, you can declutter your vectors using the strip function. This saves disk space and gains some speed. Example 5-55 compares what an unstripped versus stripped vector looks like.
SELECT
fts
FROM
film
WHERE
film_id
=
1
;
'academi':1A 'battl':15B 'canadian':20B 'dinosaur':2A 'drama':5B 'epic':4B 'feminist':8B 'mad':11B 'must':14B 'rocki':21B 'scientist':12B 'teacher':17B
SELECT
strip
(
fts
)
FROM
film
WHERE
film_id
=
1
;
'academi' 'battl' 'canadian' 'dinosaur' 'drama' 'epic' 'feminist' 'mad' 'must' 'rocki' 'scientist' 'teacher'
Keep in mind that although a stripped vector is faster to search and takes up less disk space, many operators and functions cannot be used in conjunction with them. For instance, because a stripped vector has no markers, distance operators cannot be used.
New in version 10 are ts_headline
and
to_tsvector
, which take as input json and jsonb data. The functions work just
like the text ones, except they consider only the values of json/jsonb
data and not the keys or json markup. Example 5-56 applies the function to the json
person column of the table we created in Example 5-28.
SELECT
to_tsvector
(
person
)
FROM
persons
WHERE
id
=
1
;
to_tsvector ---------------------------------------------------------------------------------- '-5083':19 '-6719':13 '-722':12 '-852':18 '619':11,17 'alex':3 'azaleah':25 'brandon':21 'cell':15 'm':23 'ofelia':7 'rafael':5 'sonia':1 'work':9 (1 row)
To apply this function to the jsonb table
persons_b
, swap out the persons
table for persons_b
. Similar to the to_tsvector for
text, these functions also have a variant that takes the FTS
configuration to use as their first argument. To make best use of these
functions, create a tsvector column in your table and populate the field
using either a trigger or update as needed.
Also available now for json and jsonb is the ts_headline
function, which tags as
HTML all matching text in the json document. Example 5-57 flags all references to Rafael in
the document.
SELECT
ts_headline
(
person
->
'spouse'
->
'parents'
,
'rafael'
::
tsquery
)
FROM
persons_b
WHERE
id
=
1
;
{"father": "<b>Rafael</b>", "mother": "Ofelia"} (1 row)
This section demonstrates how to define and use a custom type. The
composite
(aka record
, row
) object
type is often used to build an object that is then cast to a custom
type, or as a return type for functions needing to return multiple
columns.
PostgreSQL automatically creates custom types for all tables. For all intents and purposes, you can use custom types just as you would any other built-in type. So we could conceivably create a table that has a column type that is another table’s custom type, and we can go even further and make an array of that type. We demonstrate this “turducken” in Example 5-58.
CREATE
TABLE
chickens
(
id
integer
PRIMARY
KEY
);
CREATE
TABLE
ducks
(
id
integer
PRIMARY
KEY
,
chickens
chickens
[]);
CREATE
TABLE
turkeys
(
id
integer
PRIMARY
KEY
,
ducks
ducks
[]);
INSERT
INTO
ducks
VALUES
(
1
,
ARRAY
[
ROW
(
1
)::
chickens
,
ROW
(
1
)::
chickens
]);
INSERT
INTO
turkeys
VALUES
(
1
,
array
(
SELECT
d
FROM
ducks
d
));
We create an instance of a chicken without adding it to the
chicken
table itself; hence we’re
able to repeat id
with impunity. We take our array of two
chickens, stuff them into one duck, and add it to the ducks
table. We take the duck we added and
stuff it into the turkeys
table.
Finally, let’s see what we have in our turkey:
SELECT
*
FROM
turkeys
;
output -------------------------- id | ducks ---+---------------------- 1 | {"(1,"{(1),(1)}")"}
We can also replace subelements of our turducken. This next example replaces our second chicken in our first turkey with a different chicken:
UPDATE
turkeys
SET
ducks
[
1
].
chickens
[
2
]
=
ROW
(
3
)::
chickens
WHERE
id
=
1
RETURNING
*
;
output -------------------------- id | ducks ---+---------------------- 1 | {"(1,"{(1),(3)}")"}
We used the RETURNING
clause as discussed in “Returning Affected Records to the User”
to output the changed record.
Any complex row or column, regardless of how complex, can be converted to a json or jsonb column like so:
SELECT
id
,
to_jsonb
(
ducks
)
AS
ducks_jsonb
FROM
turkeys
;
id | ducks_jsonb ---+------------------------------------------------ 1 | [{"id": 1, "chickens": [{"id": 1}, {"id": 3}]}] (1 row)
PostgreSQL internally keeps track of object dependencies. The
ducks.chickens
column is dependent on the
chickens
table. The turkeys.ducks
column is
dependent on the ducks
table. You won’t be able to drop the
chickens
table without specifying
CASCADE
or first dropping the ducks.chickens
column. If you do a CASCADE
, the
ducks.chickens
column will be gone, and without warning,
your turkeys will have no chickens in their ducks.
Although you can easily create composite types just by creating a table, at some point, you’ll probably want to build your own from scratch. For example, let’s build a complex number data type with the following statement:
CREATE
TYPE
complex_number
AS
(
r
double
precision
,
i
double
precision
);
We can then use this complex number as a column type:
CREATE
TABLE
circuits
(
circuit_id
serial
PRIMARY
KEY
,
ac_volt
complex_number
);
We can then query our table with statements such as:
SELECT
circuit_id
,
(
ac_volt
).
*
FROM
circuits
;
or an equivalent:
SELECT
circuit_id
,
(
ac_volt
).
r
,
(
ac_volt
).
i
FROM
circuits
;
NULL is a confusing concept in the ANSI SQL Standard, primarily because
NULL != NULL
. When working with NULLs, instead, you need to
use IS NULL, IS NOT NULL, or NOT (somevalue
IS NULL). With
noncomposite types, something IS NULL
is generally the
antithesis to something IS NOT NULL
. This is not the case
with composites, however.
PostgreSQL abides by the ANSI SQL standard specs when dealing with NULLs. The specs require that in order for a composite to be IS NULL, all elements of the composite must be NULL. Here is where confusion can enter. In order for a composite to be considered IS NOT NULL, every element in the composite must return true for IS NOT NULL.
After you build a custom type such as a complex number, naturally you’ll want to create
functions and operators for it. We’ll demonstrate building a
+
operator for the complex_number
we created.
For more details about building functions, see Chapter 8. As stated earlier, an operator is a
symbol alias for a function that takes one or two arguments. You can
find more details about what symbols and sets of symbols are allowed in CREATE
OPERATOR.
In addition to being an alias, an operator contains optimization information that can be used by the query optimizer to decide how indexes should be used, how best to navigate the data, and which operator expressions are equivalent. More details about these optimizations and how each can help the optimizer are in Operator Optimization.
The first step to creating an operator is to create a function, as shown in Example 5-59.
CREATE
OR
REPLACE
FUNCTION
add
(
complex_number
,
complex_number
)
RETURNS
complex_number
AS
$$
SELECT
((
COALESCE
((
$
1
).
r
,
0
)
+
COALESCE
((
$
2
).
r
,
0
)),
(
COALESCE
((
$
1
).
i
,
0
)
+
COALESCE
((
$
2
).
i
,
0
)))::
complex_number
;
$$
language
sql
;
The next step is to create a symbolic operator to wrap the function, as in Example 5-60.
CREATE
OPERATOR
+
(
PROCEDURE
=
add
,
LEFTARG
=
complex_number
,
RIGHTARG
=
complex_number
,
COMMUTATOR
=
+
);
We can then test our new +
operator:
SELECT
(
1
,
2
)::
complex_number
+
(
3
,
-
10
)::
complex_number
;
which outputs (4,-8)
.
Although we didn’t demonstrate it here, you can overload functions
and operators to take different types as inputs. For example, you can
create an add
function and companion +
operator that takes a complex_number
and an
integer
.
The ability to build custom types and operators pushes PostgreSQL to the boundary of a full-fledged development environment, bringing us ever closer to our utopia where everything is table-driven.