Chapter 5. Data Types

PostgreSQL supports the workhorse data types of any database: numerics, strings, dates, times, and booleans. But PostgreSQL sprints ahead by adding support for arrays, time zone−aware datetimes, time intervals, ranges, JSON, XML, and many more. If that’s not enough, you can invent custom types. In this chapter, we don’t intend to cover every data type. For that, there’s always the manual. We showcase data types that are unique to PostgreSQL and nuances in how PostgreSQL handles common data types.

No data type would be useful without a cast of supporting functions and operators. And PostgreSQL has plenty of them. We’ll cover the more popular ones in this chapter.

Tip

When we use the term function, we’re talking about something that’s of the form f(x). When we use the term operator, we’re talking about something that’s symbolic and either unary (having one argument) or binary (having two arguments) such as +, -, *, or /. When using operators, keep in mind that the same symbol can take on a different meaning when applied to different data types. For example, the plus sign means adding for numerics but unioning for ranges.

Numerics

You will find your everyday integers, decimals, and floating-point numbers in PostgreSQL. Of the numeric types, we want to discuss serial data types and a nifty function to quickly generate arithmetic series of integers.

Serials

Serial and its bigger sibling, bigserial, are auto-incrementing integers often used as primary keys of tables in which a natural key is not apparent. This data type goes by different names in different database products, with autonumber being the most common alternative moniker. When you create a table and specify a column as serial, PostgreSQL first creates an integer column and then creates a sequence object named table_name_column_name_seq located in the same schema as the table. It then sets the default of the new integer column to read its value from the sequence. If you drop the column, PostgreSQL also drops the companion sequence object.

In PostgreSQL, the sequence type is a database asset in its own right. You can inspect and edit the sequences using SQL with the ALTER SEQUENCE command or using PGAdmin. You can set the current value, boundary values (both the upper and lower bounds), and even how many numbers to increment each time. Though decrementing is rare, you can do it by setting the increment value to a negative number. Because sequences are independent database assets, you can create them separately from a table using the CREATE SEQUENCE command, and you can use the same sequence across multiple tables. The cross-table sharing of the same sequence comes in handy when you’re assigning a universal key in your database.

To use an extant sequence for subsequent tables, create a new column in the table as integer or bigint—not as serial—then set the default value of the column using the nextval(sequence_name) function as shown in Example 5-1.

Example 5-1. Using existing sequence for new tables
CREATE SEQUENCE s START 1;
CREATE TABLE stuff(id bigint DEFAULT nextval('s') PRIMARY KEY, name text);
Warning

If you rename a table that has a serial based on a sequence, PostgreSQL will not automatically rename the sequence object. To avoid confusion, you should rename the sequence object.

Generate Series Function

PostgreSQL has a nifty function called generate_series not found in other database products. The function comes in two forms. One is a numeric version that creates a sequence of integers incremented by some value and one that creates a sequence of dates or timestamps incremented by some time interval. What makes generate_series so convenient is that it allows you to effectively mimic a for loop in SQL. Example 5-2 demonstrates the numeric version. Example 5-13 demonstrates the temporal version.

Example 5-2 uses integers with an optional step parameter.

Example 5-2. generate_series() with stepping of 13
SELECT x FROM generate_series(1,51,13) As x;
x
----
1
14
27
40

The default step is 1. As demonstrated in Example 5-2, you can pass in an optional step argument to specify how many steps to skip for each successive element. The end value will never exceed our prescribed range, so although our range ends at 51, our last number is 40 because adding another 13 to our 40 busts the upper bound.

Textuals

There are three primitive textual types in PostgreSQL: character (abbreviable as char), character varying (abbreviable as varchar), and text.

Use char only when the values stored are fixed length, such as postal codes, phone numbers, and Social Security numbers in the US. If your value is under the length specified, PostgreSQL automatically adds spaces to the end. When compared with varchar or text, the right-padding takes up more superfluous storage, but you get the assurance of an invariable length. There is absolutely no speed performance benefit of using char over varchar or text and char will always take up more disk space. Use character varying to store strings with varying length. When defining varchar columns, you should specify the maximum length of a varchar. Text is the most generic of the textual data types. With text, you cannot specify a maximum length.

The max length modifier for varchar is optional. Without it, varchar behaves almost identically to text. Subtle differences do surface when connecting to PostgreSQL via drivers. For instance, the ODBC driver cannot sort text columns. Both varchar and text have a maximum storage of 1G for each value—that’s a lot! Behind the scenes, any value larger than what can fit in a record page gets pushed to TOAST.

Some folks advocate abandoning varchar and always using text. Rather than waste space arguing about it here, read the debate at In Defense of Varchar(X).

Often, for cross-system compatibility, you want to remove case sensitivity from your character types. To do this, you need to override comparison operators that take case into consideration. Overriding operators is easier for varchar than it is for text. We demonstrate an example in Using MS Access with PostgreSQL, where we show how to make varchar behave without case sensitivity and still be able to use an index.

String Functions

Common string manipulations are padding (lpad, rpad), trimming whitespace (rtrim, ltrim, trim, btrim), extracting substrings (substring), and concatenating (||). Example 5-3 demonstrates padding, and Example 5-4 demonstrates trimming.

Example 5-3. Using lpad and rpad
SELECT
    lpad('ab', 4, '0') As ab_lpad,
    rpad('ab', 4, '0') As ab_rpad,
    lpad('abcde', 4, '0') As ab_lpad_trunc; 1
ab_lpad | ab_rpad | ab_lpad_trunc
--------+---------+---------------
00ab    | ab00    | abcd
1

lpad truncates instead of padding if the string is too long.

By default, trim functions remove spaces, but you can pass in an optional argument indicating other characters to trim.

Example 5-4. Trimming spaces and characters
SELECT
    a As a_before, trim(a) As a_trim, rtrim(a) As a_rt,
    i As i_before, ltrim(i, '0') As i_lt_0,
    rtrim(i, '0') As i_rt_0, trim(i, '0') As i_t_0
FROM (
	SELECT repeat(' ', 4) || i || repeat(' ', 4) As a, '0' || i As i
	FROM generate_series(0, 200, 50) As i
) As x;
a_before | a_trim | a_rt | i_before | i_lt_0 | i_rt_0 | i_t_0
---------+--------+------+----------+--------+--------+------
0        | 0      | 0    | 00       |        |        |
50       | 50     | 50   | 050      | 50     | 05     | 5
100      | 100    | 100  | 0100     | 100    | 01     | 1
150      | 150    | 150  | 0150     | 150    | 015    | 15
200      | 200    | 200  | 0200     | 200    | 02     | 2

A helpful function for aggregating strings is the string_agg function, which we demonstrate in Examples 3-14 and 5-26.

Splitting Strings into Arrays, Tables, or Substrings

There are a couple of useful functions in PostgreSQL for tearing strings apart.

The split_part function is useful for extracting an element from a delimited string, as shown in Example 5-5. Here, we select the second item in a string of items delimited by periods.

Example 5-5. Getting the nth element of a delimited string
SELECT split_part('abc.123.z45','.',2) As x;
x
---
123

The string_to_array function is useful for creating an array of elements from a delimited string. By combining string_to_array with the unnest function, you can expand the returned array into a set of rows, as shown in Example 5-6.

Example 5-6. Converting a delimited string to an array to rows
SELECT unnest(string_to_array('abc.123.z45', '.')) As x;
x
---
abc
123
z45

Regular Expressions and Pattern Matching

PostgreSQL’s regular expression support is downright fantastic. You can return matches as tables or arrays and choreograph replaces and updates. Back-referencing and other fairly advanced search patterns are also supported. In this section, we’ll provide a small sampling. For more information, see Pattern Matching and String Functions.

Example 5-7 shows you how to format phone numbers stored simply as contiguous digits.

Example 5-7. Reformat a phone number using back-referencing
SELECT regexp_replace(
'6197306254',
'([0-9]{3})([0-9]{3})([0-9]{4})',
 E'(\1) \2-\3'
 ) As x;
x
--------------
(619) 730-6254

The \1, \2, etc., refer to elements in our pattern expression. We use a backslash () to escape the parentheses. The E' construct is PostgreSQL syntax for denoting that the string to follow should be taken literally.

Suppose some field contains text with embedded phone numbers; Example 5-8 shows how to extract the phone numbers and turn them into rows all in one step.

Example 5-8. Return phone numbers in piece of text as separate rows
SELECT unnest(regexp_matches(
'Cell (619) 852-5083. Work (619)123-4567 , Casa 619-730-6254. Bésame mucho.',
E'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}', 'g')
) As x;
x
--------------
(619) 852-5083
(619)123-4567
619-730-6254
(3 rows)

The matching rules for Example 5-8 are:

  • [(]{0,1}: starts with zero or one open parenthesis.

  • [0-9]{3}: followed by three digits.

  • [)-.]{0,1}: followed by zero or one closed parenthesis, hyphen, or period.

  • [\s]+: followed by zero or more spaces.

  • [0-9]{4}: followed by four digits.

  • regexp_matches returns a string array consisting of matches of a regular expression. The last input to our function is the flags parameter. We set this to g, which stands for global and returns all matches of a regular expression as separate elements. If you leave out this flags parameter, then your array will only contain the first match. The flags parameter can consist of more than one flag. For example, if you have letters in your regular expression and text and you want to make the check case insensitive and global, you would use two flags, gi. In addition to the global flag, other allowed flags are listed in POSIX EMBEDDED OPTIONS.

  • unnest explodes an array into a row set.

Tip

There are many ways to compose the same regular expression. For instance, \d is shorthand for [0-9]. But given the few characters you’d save, we prefer the more descriptive longhand.

If you only care about the first match, you can utilize the substring function, which will return the first matching value as shown in Example 5-9.

Example 5-9. Return first phone number in piece of text
SELECT substring(
'Cell (619) 852-5083. Work (619)123-4567 , Casa 619-730-6254. Bésame mucho.'
from E'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}')
 As x;
       x
----------------
 (619) 852-5083
(1 row)

In addition to the wealth of regular expression functions, you can use regular expressions with the SIMILAR TO (~) operators. The following example returns all description fields with embedded phone numbers:

SELECT description
FROM mytable
WHERE description ~  
E'[(]{0,1}[0-9]{3}[)-.]{0,1}[\s]{0,1}[0-9]{3}[-.]{0,1}[0-9]{4}';

Temporals

PostgreSQL support for temporal data is second to none. In addition to the usual dates and times types, PostgreSQL supports time zones, enabling the automatic handling of daylight saving time (DST) conversions by region. Specialized data types such as interval offer datetime arithmetic. PostgreSQL also understands infinity and negative infinity, relieving us from having to create conventions that we’ll surely forget. Range types provide support for temporal ranges with a slew of companion operators, functions, and indexes. We cover range types in “Range Types”.

At last count, PostgreSQL has nine temporal data types. Understanding their distinctions is important in ensuring that you choose the right data type for the job. All the types except range abide by ANSI SQL standards. Other leading database products support some, but not all, of these data types. Oracle has the most varieties of temporal types; SQL Server ranks second; and MySQL comes in last.

PostgreSQL temporal types vary in a number of ways to handle different situations. If a type is time zone−aware, the time changes if you change your server’s time zone. The types are:

date

Stores the month, day, and year, with no time zone awareness and no concept of hours, minutes, or seconds.

time (aka time without time zone)

Stores hours, minutes, and seconds with no awareness of time zone or calendar dates.

timestamp (aka timestamp without time zone)

Stores both calendar dates and time (hours, minutes, seconds) but does not care about the time zone.

timestamptz (aka timestamp with time zone)

A time zone−aware date and time data type. Internally, timestamptz is stored in Coordinated Universal Time (UTC), but its display defaults to the time zone of the server, the service config, the database, the user, or the session. Yes, you can observe different time zones at different levels. If you input a timestamp with no time zone and cast it to one with the time zone, PostgreSQL assumes the default time zone in effect. If you don’t set your time zone in postgresql.conf, the server’s default takes effect. This means that if you change your server’s time zone, you’ll see all the displayed times change after the PostgreSQL server restarts.

timetz (aka time with time zone)

The lesser-used sister of timestamptz. It is time zone−aware but does not store the date. It always assumes DST of the current date and time. Some programming languages with no concept of time without date might map timetz to a timestamp with some arbitrary date such as Unix Epoch 1970, resulting in year 1970 being assumed.

interval

A duration of time in hours, days, months, minutes, and others. It comes in handy for datetime arithmetic. For example, if the world is supposed to end in exactly 666 days from now, all you have to do is add an interval of 666 days to the current time to get the exact moment (and plan accordingly).

tsrange

Allows you to define opened and closed ranges of timestamp with no timezone. The type consists of two timestamps and opened/closed range qualifiers. For example, '[2012-01-01 14:00, 2012-01-01 15:00)'::tsrange defines a period starting at 14:00 but ending before 15:00. Refer to Range Types for details.

tstzrange

Allows you to define opened and closed ranges of timestamp with timezone.

daterange

Allows you to define opened and closed ranges of dates.

Time Zones: What They Are and Are Not

A common misconception with PostgreSQL time zone−aware data types is that PostgreSQL records an extra time marker with the datetime value itself. This is incorrect. If you save 2012-2-14 18:08:00-8 (-8 being the Pacific offset from UTC), PostgreSQL internally thinks like this:

  1. Calculate the UTC time for 2012-02-14 18:08:00-8. This is 2012-02-15 04:08:00-0.

  2. Store the value 2012-02-15 04:08:00.

When you call the data back for display, PostgreSQL internally works like this:

  1. Start with the requested time zone, defaulting to the server time zone if none is requested.

  2. Compute the offset for time zone for this UTC time (-5 for America/New_York).

  3. Determine the datetime with the offset (2012-02-15 16:08:00 with a -5 offset becomes 2012-02-15 21:08:00).

  4. Display the result (2012-02-15 21:08:00-5).

So PostgreSQL doesn’t store the time zone, but uses it only to convert the datetime to UTC before storage. After that, the time zone information is discarded. When PostgreSQL displays datetime, it does so in the default time zone dictated by the session, user, database, or server, in that order. If you use time zone−aware data types, you should consider the consequence of a server move from one time zone to another. Suppose you based a server in New York City and subsequently restored the database in Los Angeles. All timestamps with time zone fields could suddenly display in Pacific time. This is fine as long as you anticipate this behavior.

Here’s an example of how something can go wrong. Suppose that McDonald’s had its server on the East Coast and the opening time for stores is stored as timetz. A new McDonald’s opens up in San Francisco. The new franchisee phones McDonald’s headquarters to add its store to the master directory with an opening time of 7 a.m. The data entry dude entered the information as he is told: 7 a.m. The East Coast PostgreSQL server interprets this to mean 7 a.m. Eastern, and now early risers in San Francisco are lining up at the door wondering why they can’t get their McBreakfasts at 4 a.m. Being hungry is one thing, but we can imagine many situations in which confusion over a difference of three hours could mean life or death.

Given the pitfalls, why would anyone want to use time zone−aware data types? First, it does spare you from having to do time zone conversions manually. For example, if a flight leaves Boston at 8 a.m. and arrives in Los Angeles at 11 a.m., and your server is in Europe, you don’t want to have to figure out the offset for each time manually. You could just enter the data with the Boston and Los Angeles local times. There’s another convincing reason to use time zone−aware data types: the automatic handling of DST. With countries deviating more and more from one another in DST schedules, manually keeping track of DST changes for a globally used database would require a dedicated programmer who does nothing but keep up-to-date with the latest DST schedules and map them to geographic enclaves.

Here’s an interesting example: a traveling salesperson catches a flight home from San Francisco to nearby Oakland. When she boards the plane, the clock at the terminal reads 2012-03-11 1:50 a.m. When she lands, the clock in the terminal reads 2012-03-11 3:10 a.m. How long was the flight? The key to the solution is that the change to DST occurred during the flight—the clocks sprang forward. With time zone−aware timestamps, you get 20 minutes, to which is a plausible answer for a short flight across the Bay. We get the wrong answer if we don’t use time zone−aware timestamps:

SELECT '2012-03-11 3:10 AM America/Los_Angeles'::timestamptz
 - '2012-03-11 1:50 AM America/Los_Angeles'::timestamptz;

gives you 20 minutes, whereas:

SELECT '2012-03-11 3:10 AM'::timestamp - '2012-03-11 1:50 AM'::timestamp;

gives you 1 hour and 20 minutes.

Let’s drive the point home with more examples, using a Boston server. For Example 5-10, I input my time in Los Angeles local time, but because my server is in Boston, I get a time returned in Boston local time. Note that it does give me the offset but that is merely display information. The timestamp is internally stored in UTC.

Example 5-10. Inputting time in one time zone and output in another
SELECT '2012-02-28 10:00 PM America/Los_Angeles'::timestamptz;
2012-02-29 01:00:00-05

In Example 5-11, we are getting back a timestamp without time zone. So the answer you get when you run this same query will be the same as mine, regardless of where in the world you are.

Example 5-11. Timestamp with time zone to timestamp at location
SELECT '2012-02-28 10:00 PM America/Los_Angeles'::timestamptz 
AT TIME ZONE 'Europe/Paris';
2012-02-29 07:00:00

The query is asking: what time is it in Paris if it’s 2012-02-28 10:00 p.m. in Los Angeles? Note the absence of the UTC offset in the result. Also, notice how you can specify a time zone with its official name rather than just an offset. Visit Wikipedia for a list of official time zone names.

Datetime Operators and Functions

The inclusion of a temporal interval data type greatly eases date and time arithmetic in PostgreSQL. Without it, we’d have to create another family of functions or use a nesting of functions as many other databases do. With intervals, we can add and subtract timestamp data simply by using the arithmetic operators we’re intimately familiar with. The following examples demonstrate operators and functions used with date and time data types.

The addition operator (+) adds an interval to a timestamp:

SELECT '2012-02-10 11:00 PM'::timestamp + interval '1 hour';
2012-02-11 00:00:00

You can also add intervals:

SELECT '23 hours 20 minutes'::interval + '1 hour'::interval;
24:20:00

The subtraction operator (-) subtracts an interval from a temporal type:

SELECT '2012-02-10 11:00 PM'::timestamptz - interval '1 hour';
2012-02-10 22:00:00-05

OVERLAPS, demonstrated in Example 5-12, returns true if two temporal ranges overlap. This is an ANSI SQL predicate equivalent to the overlaps function. OVERLAPS takes four parameters, the first pair constituting one range and the last pair constituting the other range. An overlap considers the time periods to be half open, meaning that the start time is included but the end time is outside the range. This is slightly different behavior from the common BETWEEN predicate, which considers both start and end to be included. This quirk won’t make a difference unless one of your ranges is a fixed point in time (a period for which start and end are identical). Watch out for this if you’re an avid user of the OVERLAPS function.

Example 5-12. OVERLAPS for timestamp and date
SELECT
	('2012-10-25 10:00 AM'::timestamp, '2012-10-25 2:00 PM'::timestamp)
	OVERLAPS
	('2012-10-25 11:00 AM'::timestamp,'2012-10-26 2:00 PM'::timestamp) AS x,
	('2012-10-25'::date,'2012-10-26'::date)
	OVERLAPS
	('2012-10-26'::date,'2012-10-27'::date) As y;
x  |y
---+---
t  |f

In addition to operators and predicates, PostgreSQL comes with functions supporting temporal types. A full listing can be found at Datetime Functions and Operators. We’ll demonstrate a sampling here.

Once again, we start with the versatile generate_series function. You can use this function with temporal types and interval steps.

As you can see in Example 5-13, we can express dates in our local datetime format or the more global ISO yyyy-mm-dd format. PostgreSQL automatically interprets differing input formats. To be safe, we tend to stick with entering dates in ISO, because date formats vary from culture to culture, server to server, or even database to database.

Example 5-13. Generate time series using generate_series()
SELECT (dt - interval '1 day')::date As eom
FROM generate_series('2/1/2012', '6/30/2012', interval '1 month') As dt;
eom
------------
2012-01-31
2012-02-29
2012-03-31
2012-04-30
2012-05-31

Another popular activity is to extract or format parts of a datetime value. Here, the functions date_part and to_char fit the bill. Example 5-14 also drives home the behavior of DST for a time zone−aware data type. We intentionally chose a period that crosses a daylight saving switchover in US/East. Because the clock springs forward at 2 a.m., the final row of the table reflects the new time.

Example 5-14. Extracting elements of a datetime value
SELECT dt, date_part('hour',dt) As hr, to_char(dt,'HH12:MI AM') As mn
FROM
generate_series(
	'2012-03-11 12:30 AM',
	'2012-03-11 3:00 AM',
	interval '15 minutes'
) As dt;
dt                     | hr | mn
-----------------------+----+----------
2012-03-11 00:30:00-05 |  0 | 12:30 AM
2012-03-11 00:45:00-05 |  0 | 12:45 AM
2012-03-11 01:00:00-05 |  1 | 01:00 AM
2012-03-11 01:15:00-05 |  1 | 01:15 AM
2012-03-11 01:30:00-05 |  1 | 01:30 AM
2012-03-11 01:45:00-05 |  1 | 01:45 AM
2012-03-11 03:00:00-04 |  3 | 03:00 AM

By default, generate_series assumes timestamptz if you don’t explicitly cast values to timestamp.

Arrays

Arrays play an important role in PostgreSQL. They are particularly useful in building aggregate functions, forming IN and ANY clauses, and holding intermediary values for morphing to other data types. In PostgreSQL, every data type has a companion array type. If you define your own data type, PostgreSQL creates a corresponding array type in the background for you. For example, integer has an integer array type integer[], character has a character array type character[], and so forth. We’ll show you some useful functions to construct arrays short of typing them in manually. We will then point out some handy functions for array manipulations. You can get the complete listing of array functions and operators in the Official Manual: Array Functions and Operators.

Array Constructors

The most rudimentary way to create an array is to type the elements:

SELECT ARRAY[2001, 2002, 2003] As yrs;

If the elements of your array can be extracted from a query, you can use the more sophisticated constructor function, array():

SELECT array(
SELECT DISTINCT date_part('year', log_ts)
FROM logs
ORDER BY date_part('year', log_ts)
);

Although the array function has to be used with a query returning a single column, you can specify a composite type as the output, thereby achieving multicolumn results. We demonstrate this in “Custom and Composite Data Types”.

You can cast a string representation of an array to an array with syntax of the form:

SELECT '{Alex,Sonia}'::text[] As name, '{46,43}'::smallint[] As age;
name         | age
-------------+--------
{Alex,Sonia} | {46,43}

You can convert delimited strings to an array with the string_to_array function, as demonstrated in Example 5-15.

Example 5-15. Converting a delimited string to an array
SELECT string_to_array('CA.MA.TX', '.') As estados;
estados
----------
{CA,MA,TX}
(1 row)

array_agg is an aggregate function that can take a set of any data type and convert it to an array, as demonstrated in Example 5-16.

Example 5-16. Using array_agg
SELECT array_agg(log_ts ORDER BY log_ts) As x
FROM logs
WHERE log_ts BETWEEN '2011-01-01'::timestamptz AND '2011-01-15'::timestamptz;
x
------------------------------------------
{'2011-01-01', '2011-01-13', '2011-01-14'}

PostgreSQL 9.5 introduced array_agg function support for arrays. In prior versions if you wanted to aggregate rows of arrays with array_agg, you’d get an error. array_agg support for arrays makes it much easier to build multidimensional arrays from one-dimensional arrays, as shown in Example 5-17.

Example 5-17. Creating multidimensional arrays from one-dimensional arrays
SELECT array_agg(f.t)
 FROM ( VALUES ('{Alex,Sonia}'::text[]),
    ('{46,43}'::text[] ) ) As f(t);
array_agg
----------------------
{{Alex,Sonia},{46,43}}
(1 row)

In order to aggregate arrays, they must be of the same data type and the same dimension. To force that in Example 5-17, we cast the ages to text. We also have the same number of items in the arrays being aggregated: two people and two ages. Arrays with the same number of elements are called balanced arrays.

Unnesting Arrays to Rows

A common function used with arrays is unnest, which allows you to expand the elements of an array into a set of rows, as demonstrated in Example 5-18.

Example 5-18. Expanding arrays with unnest
SELECT unnest('{XOX,OXO,XOX}'::char(3)[]) As tic_tac_toe;
tic_tac_toe
---
XOX
OXO
XOX

Although you can add multiple unnests to a single SELECT, if the number of resultant rows from each array is not balanced, you may get some head-scratching results.

A balanced unnest, as shown in Example 5-19, yields three rows.

Example 5-19. Unnesting balanced arrays
SELECT
unnest('{three,blind,mice}'::text[]) As t,
unnest('{1,2,3}'::smallint[]) As i;
t     |i
------+-
three |1
blind |2
mice  |3

If you remove an element of one array so that you don’t have an equal number of elements in both, you get the result shown in Example 5-20.

Example 5-20. Unnesting unbalanced arrays
SELECT
unnest( '{blind,mouse}'::varchar[]) AS v,
unnest('{1,2,3}'::smallint[]) AS i;
v     |i
------+-
blind |1
mouse |2
blind |3
mouse |1
blind |2
mouse |3

Version 9.4 introduced a multiargument unnest function that puts in null placeholders where the arrays are not balanced. The main drawback with the new unnest is that it can appear only in the FROM clause. Example 5-21 revisits our unbalanced arrays using the version 9.4 construct.

Example 5-21. Unnesting unbalanced arrays with multiargument unnest
SELECT * FROM unnest('{blind,mouse}'::text[], '{1,2,3}'::int[]) AS f(t,i);
t      | i
-------+---
blind  | 1
mouse  | 2
<NULL> | 3

Array Slicing and Splicing

PostgreSQL also supports array slicing using the start:end syntax. It returns another array that is a subarray of the original. For example, to return new arrays that just contain elements 2 through 4 of each original array, type:

SELECT fact_subcats[2:4] FROM census.lu_fact_types;

To glue two arrays together end to end, use the concatenation operator ||:

SELECT fact_subcats[1:2] || fact_subcats[3:4] FROM census.lu_fact_types;

You can also add additional elements to an existing array as follows:

SELECT '{1,2,3}'::integer[] || 4 || 5;

The result is {1,2,3,4,5}.

Referencing Elements in an Array

Elements in arrays are most commonly referenced using the index of the element. PostgreSQL array indexes start at 1. If you try to access an element above the upper bound, you won’t get an error—only NULL will be returned. The next example grabs the first and last element of our array column:

SELECT
    fact_subcats[1] AS primero,
    fact_subcats[array_upper(fact_subcats, 1)] As segundo
FROM census.lu_fact_types;

We used the array_upper function to get the upper bound of the array. The second required parameter of the function indicates the dimension. In our case, our array is one-dimensional, but PostgreSQL does support multidimensional arrays.

Array Containment Checks

PostgreSQL has several operators for working with array data. We already saw the concatenation operator (||) for combining multiple arrays into one or adding an element to an array in “Array Slicing and Splicing”. Arrays also support the following comparison operators: =, <>, <, >, @>, <@, and &&. These operators require both sides of the operator to be arrays of the same array data type. If you have a GiST or GIN index on your array column, the comparison operators can utilize them.

The overlap operator (&&) returns true if two arrays have any elements in common. Example 5-22 will list all records in our table where the fact_subcats contains elements OCCUPANCY STATUS or For rent.

Example 5-22. Array overlaps operator
SELECT fact_subcats
FROM census.lu_fact_types
WHERE fact_subcats && '{OCCUPANCY STATUS,For rent}'::varchar[];
fact_subcats
-----------------------------------------------------------
{S01,"OCCUPANCY STATUS","Total housing units"...}
{S02,"OCCUPANCY STATUS","Total housing units"...}
{S03,"OCCUPANCY STATUS","Total housing units"...}
{S10,"VACANCY STATUS","Vacant housing units","For rent"...}
(4 rows)

The equality operator (=) returns true only if elements in all the arrays are equal and in the same order. If you don’t care about order of elements, and just need to know whether all the elements in one array appear as a subset of the other array, use the containment operators (@> , <@). Example 5-23 demonstrates the difference between the contains (@>) and contained by (@<) operators.

Example 5-23. Array containment operators
SELECT '{1,2,3}'::int[] @> '{3,2}'::int[] AS contains;
contains
--------
t
(1 row)
SELECT '{1,2,3}'::int[] <@ '{3,2}'::int[] AS contained_by;
contained_by
------------
f
(1 row)

Range Types

Range data types represent data with a beginning and an end. PostgreSQL also rolled out many operators and functions to identify overlapping ranges, check to see whether a value falls inside the range, and combine adjacent smaller ranges into larger ranges. Prior to range types, we had to kludge our own functions. These often were clumsy and slow, and didn’t always produce the expected results. We’ve been so happy with ranges that we’ve converted all of our temporal tables to use them where possible. We hope you share our joy.

Range types replace the need to use two separate fields to represent ranges. Suppose we want all integers between −2 and 2, but not including 2. The range representation would be [-2,2). The square bracket indicates a range that is closed on that end, whereas a parenthesis indicates a range that is open on that end. Thus, [-2,2) includes exactly four integers: −2, −1, 0, 1. Similarly:

  • The range (-2,2] includes four integers: -1, 0, 1, 2.

  • The range (-2,2) includes three integers: -1, 0, 1.

  • The range [-2,2] includes five integers: -2, -1, 0, 1, 2.

Discrete Versus Continuous Ranges

PostgreSQL makes a distinction between discrete and continuous ranges. A range of integers or dates is discrete because you can enumerate each value within the range. Think of dots on a number line. A range of numerics or timestamps is continuous, because an infinite number of values lies between the end points.

A discrete range has multiple representations. Our earlier example of [-2,2) can be represented in the following ways and still include the same number of values in the range: [-2,1], (-3,1], (-3,2), [-2,2). Of these four representations, the one with [) is considered the canonical form. There’s nothing magical about closed-open ranges except that if everyone agrees to using that representation for discrete ranges, we can easily compare among many ranges without having to worry first about converting open to close or vice versa. PostgreSQL canonicalizes all discrete ranges, for both storage and display. So if you enter a date range as (2014-1-5,2014-2-1], PostgreSQL rewrites it as [2014-01-06,2014-02-02).

Built-in Range Types

PostgreSQL comes with six built-in range types for numbers and datetimes:

int4range, int8range

A range of integers. Integer ranges are discrete and subject to canonicalization.

numrange

A continuous range of decimals, floating-point numbers, or double-precision numbers.

daterange

A discrete date range of calendar dates without time zone awareness.

tsrange, tstzrange

A continuous date and time (timestamp) range allowing for fractional seconds. tstrange is not time zone−aware; tstzrange is time zone−aware.

For number-like ranges, if either the start point or the end point is left blank, PostgreSQL replaces it with a null. For practicality, you can interpret the null to represent either -infinity on the left or infinity on the right. In actuality, you’re bound by the smallest and largest values for the particular data type. So a int4range of (,) would be [-2147483648,2147483647).

For temporal ranges, -infinity and infinity are valid upper and lower bounds.

In addition to the built-in range types, you can create your own range types. When you do, you can set the range to be either discrete or continuous.

Defining Ranges

A range, regardless of type, is always comprised of two elements of the same type with the bounding condition denoted by brackets or parentheses, as shown in Example 5-24.

Example 5-24. Defining ranges with casts
SELECT '[2013-01-05,2013-08-13]'::daterange; 1
SELECT '(2013-01-05,2013-08-13]'::daterange; 2
SELECT '(0,)'::int8range; 3
SELECT '(2013-01-05 10:00,2013-08-13 14:00]'::tsrange; 4
[2013-01-05,2013-08-14)
[2013-01-06,2013-08-14)
[1,)
("2013-01-05 10:00:00","2013-08-13 14:00:00"]
1

A date range between 2013-01-05 and 2013-08-13 inclusive. Note the canonicalization on the upper bound.

2

A date range greater than 2013-01-05 and less than or equal to 2013-08-13. Notice the canonicalization.

3

All integers greater than 0. Note the canonicalization.

4

A timestamp greater than 2013-01-05 10:00 AM and less than or equal to 2013-08-13 2 PM.

Tip

Datetimes in PostgreSQL can take on the values of -infinity and infinity. For uniformity and in keeping with convention, we suggest that you always use [ for the former and ) for the latter as in [-infinity, infinity).

Ranges can also be defined using range constructor functions, which go by the same name as the range and can take two or three arguments. Here’s an example:

SELECT daterange('2013-01-05','infinity','[]');

The third argument denotes the bound. If omitted, the open-close [) convention is used by default. We suggest that you always include the third element for clarity.

Defining Tables with Ranges

Temporal ranges are popular. Suppose you have an employment table that stores employment history. Instead of creating separate columns for start and end dates, you can design a table as shown in Example 5-25. In the example, we added an index to the period column to speed up queries using our range column.

Example 5-25. Table with date range
CREATE TABLE employment (id serial PRIMARY KEY, employee varchar(20), 
period daterange);
CREATE INDEX ix_employment_period ON employment USING gist (period); 1
INSERT INTO employment (employee,period)
VALUES
	('Alex','[2012-04-24, infinity)'::daterange),
	('Sonia','[2011-04-24, 2012-06-01)'::daterange),
	('Leo','[2012-06-20, 2013-04-20)'::daterange),
	('Regina','[2012-06-20, 2013-04-20)'::daterange);
1

Add a GiST index on the range field.

Range Operators

Two range operators tend to be used most often: overlap (&&) and contains (@>). Those are the ones we’ll cover. To see the full catalog of range operators, go to Range Operators.

Overlap operator

As the name suggests, the overlap operator && returns true if two ranges have any values in common. Example 5-26 demonstrates this operator and puts to use the string_agg function for aggregating the list of employees into a single text field.

Example 5-26. Who worked with whom?
SELECT
	e1.employee,
	string_agg(DISTINCT e2.employee, ', ' ORDER BY e2.employee) As colleagues
FROM employment As e1 INNER JOIN employment As e2
ON e1.period && e2.period
WHERE e1.employee <> e2.employee
GROUP BY e1.employee;
employee | colleagues
---------+-------------------
Alex     | Leo, Regina, Sonia
Leo      | Alex, Regina
Regina   | Alex, Leo
Sonia    | Alex

Contains and contained in operators

In the contains operator (@>), the first argument is a range and the second is a value. If the second is within the first, the contains operator returns true. Example 5-27 demonstrates its use.

Example 5-27. Who is currently working?
SELECT employee FROM employment WHERE period @> CURRENT_DATE GROUP BY employee;
employee
--------
Alex

The reverse of the contains operator is the contained operator (<@), whose first argument is the value and the second the range.

JSON

PostgreSQL provides JSON (JavaScript Object Notation) and many support functions. JSON has become the most popular data interchange format for web applications. Version 9.3 significantly beefed up JSON support with new functions for extracting, editing, and casting to other data types. Version 9.4 introduced the JSONB data type, a binary form of JSON that can also take advantage of indexes. Version 9.5 introduced more functions for jsonb, including functions for setting elements in a jsonb object. Version 9.6 introduced the jsonb_insert function for inserting elements into an existing jsonb array or adding a new key value.

Inserting JSON Data

To create a table to store JSON, define a column as a json type:

CREATE TABLE persons (id serial PRIMARY KEY, person json);

Example 5-28 inserts JSON data. PostgreSQL automatically validates the input to make sure what you are adding is valid JSON. Remember that you can’t store invalid JSON in a JSON column, nor can you cast invalid JSON to a JSON data type.

Example 5-28. Populating a JSON field
INSERT INTO persons (person)
VALUES (
    '{
        "name":"Sonia",
        "spouse":
        {
            "name":"Alex",
            "parents":
            {
                "father":"Rafael",
                "mother":"Ofelia"
            },
            "phones":
            [
                {
                    "type":"work",
                    "number":"619-722-6719"
                },
                {
                    "type":"cell",
                    "number":"619-852-5083"
                }
            ]
        },
        "children":
        [
            {
                "name":"Brandon",
                "gender":"M"
            },
            {
                "name":"Azaleah",
                "girl":true,
                "phones": []
            }
        ]
    }'
);

Querying JSON

The easiest way to traverse the hierarchy of a JSON object is by using pointer symbols. Example 5-29 shows some common usage.

Example 5-29. Querying the JSON field
SELECT person->'name' FROM persons;
SELECT person->'spouse'->'parents'->'father' FROM persons;

You can also write the query using a path array as in the following example:

SELECT person#>array['spouse','parents','father'] FROM persons;

Notice that you must use the #> pointer symbol if what comes after is a path array.

To penetrate JSON arrays, specify the array index. JSON arrays is zero-indexed, unlike PostgreSQL arrays, whose indexes start at 1.

SELECT person->'children'->0->'name' FROM persons;

And the path array equivalent:

SELECT person#>array['children','0','name'] FROM persons;

All queries in the prior examples return the value as JSON primitives (numbers, strings, booleans). To return the text representation, add another greater-than sign as in the following examples:

SELECT person->'spouse'->'parents'->>'father' FROM persons;
SELECT person#>>array['children','0','name'] FROM persons;

If you are chaining the -> operator, only the very last one can be a ->> operator.

The json_array_elements function takes a JSON array and returns each element of the array as a separate row as in Example 5-30.

Example 5-30. json_array_elements to expand JSON array
SELECT json_array_elements(person->'children')->>'name' As name FROM persons;
name
-------
Brandon
Azaleah
(2 rows)
Note

We strongly encourage you to use pointer symbols when drilling down into a JSON object. The syntax is more succinct and you can use the same operators as for JSONB (which we’ll cover shortly). PostgreSQL does offer functional equivalents if you need them: json_extract_path is a variadic function (functions with an unlimited number of arguments). The first argument is always the JSON object you are trying to navigate; subsequent parameters are the key value for each tier of the hierarchy. The equivalent to ->> and #>> is json_extract_path_text.

Outputting JSON

In addition to querying JSON data, you can convert other data to JSON. In these next examples, we’ll demonstrate the use of JSON built-in functions to create JSON objects.

Example 5-31 demonstrates the use of row_to_json to convert a subset of columns in each record from the table we created and loaded in Example 5-28.

Example 5-31. Converting rows to individual JSON objects (requires version 9.3 or later)
SELECT row_to_json(f) As x
FROM (
    SELECT id, json_array_elements(person->'children')->>'name' As cname FROM persons
) As f;
                x
--------------------------
{"id":1,"cname":"Brandon"}
{"id":1,"cname":"Azaleah"}
(2 rows)

To output each row in our persons table as JSON:

SELECT row_to_json(f) As jsoned_row FROM persons As f;

The use of a row as an output field in a query is a feature unique to PostgreSQL. It’s handy for creating complex JSON objects. We describe it further in “Composite Types in Queries”, and Example 7-20 demonstrates the use of array_agg and array_to_json to output a set of rows as a single JSON object. In version 9.3 we have at our disposal the json_agg function. We demonstrate its use in Example 7-21.

Binary JSON: jsonb

New in PostgreSQL 9.4 is the jsonb data type. It is handled through the same operators as those for the json type, and similarly named functions, plus several additional ones. jsonb performance is much better than json performance because jsonb doesn’t need to be reparsed during operations. There are a couple of key differences between the jsonb and json data types:

  • jsonb is internally stored as a binary object and does not maintain the formatting of the original JSON text as the json data type does. Spaces aren’t preserved, numbers can appear slightly different, and attributes become sorted. For example, a number input as e-5 would be converted to its decimal representation.

  • jsonb does not allow duplicate keys and silently picks one, whereas the json type preserves duplicates. This is demonstrated in Michael Paquier’s article “Manipulating jsonb data by abusing of key uniqueness”.

  • jsonb columns can be directly indexed using the GIN index method (covered in “Indexes”), whereas json requires a functional index to extract key elements.

To demonstrate these concepts, we’ll create another persons table, replacing the json column with a jsonb:

CREATE TABLE persons_b (id serial PRIMARY KEY, person jsonb);

To insert data into our new table, we would repeat Example 5-28.

So far, working with JSON and binary JSON has been the same. Differences appear when you query. To make the binary JSON readable, PostgreSQL converts it to a canonical text representation, as shown in Example 5-32.

Example 5-32. jsonb versus json output
SELECT person As b FROM persons_b WHERE id = 1; 1
SELECT person As j FROM persons WHERE id = 1;2
b
---------------------------------------------------------------------------------
{"name": "Sonia",
"spouse": {"name": "Alex", "phones": [{"type": "work", "number": "619-722-6719"},
{"type": "cell", "number": "619-852-5083"}],
"parents": {"father": "Rafael", "mother": "Ofelia"}},
"children": [{"name": "Brandon", "gender": "M"},
 {"girl": true, "name": "Azaleah", "phones": []}]}
(1 row)
                       j
---------------------------------------------
 {
    "name":"Sonia",
         "spouse":
         {
             "name":"Alex",
             "parents":
             {
                 "father":"Rafael",
                 "mother":"Ofelia"
             },
             "phones":
             [
                 {
                     "type":"work",
                     "number":"619-722-6719"+
                 },
                 {
                     "type":"cell",
                     "number":"619-852-5083"+
                 }
             ]
         },
         "children":
         [
             {
                 "name":"Brandon",
                 "gender":"M"
             },
             {
                 "name":"Azaleah",
                 "girl":true,
                 "phones": []
             }
         ]
     }
(1 row)
1

jsonb reformats input and removes whitespace. Also, the order of attributes is not maintained from the insert.

2

json maintains input whitespace and the order of attributes.

jsonb has similarly named functions as json, plus some additional ones. So, for example, the json family of functions such as json_extract_path_text and json_each are matched in jsonb by jsonb_extract_path_text, jsonb_each, etc. However, the equivalent operators are the same, so you will find that the examples in “Querying JSON” work largely the same without change for the jsonb type—just replace the table name and json_array_elements with jsonb_array_elements.

In addition to the operators supported by json, jsonb has additional comparator operators for equality (=), contains (@>), contained (<@), key exists (?), any of array of keys exists (?|), and all of array of keys exists (?&).

So, for example, to list all people that have a child named Brandon, use the contains operator as demonstrated in Example 5-33.

Example 5-33. jsonb contains operator
 SELECT person->>'name' As name
FROM persons_b
WHERE person @> '{"children":[{"name":"Brandon"}]}';
name
-----
Sonia

These additional operators provide very fast checks when you complement them with a GIN index on the jsonb column:

CREATE INDEX ix_persons_jb_person_gin ON persons_b USING gin (person);

We don’t have enough records in our puny table for the index to kick in, but for more rows, you’d see that Example 5-33 utilizes the index.

Editing JSONB data

PostgreSQL 9.5 introduced native jsonb concatenation (||) and subtraction operators (-, #-) as well as companion functions for setting data. These operators do not exist for the json datatype. To be able to accomplish these tasks in prior versions, you’d have to lean on “Writing PL/V8, PL/CoffeeScript, and PL/LiveScript Functions” to do the work.

The concatenation operator can be used to add and replace attributes of a jsonb object. In Example 5-34 we add an address attribute to the Gomez family and use the RETURNING construct covered in “Returning Affected Records to the User” to return the updated value. The new value has an address attribute.

Example 5-34. Using JSONB || to add address
UPDATE persons_b
SET person = person || '{"address": "Somewhere in San Diego, CA"}'::jsonb
WHERE person @> '{"name":"Sonia"}'
RETURNING person;
profile
-------------------------------------------------------------------------------
{"name": "Sonia", ... "address": "Somewhere in San Diego, CA", "children": ...}
(1 row)
UPDATE 1

Because JSONB requires that keys be unique, if you try to add a duplicate key, the original value will be replaced instead. So to update with a new address, we would repeat the exercise in Example 5-34, but replacing Somewhere in San Diego, CA with something else.

If we decided we no longer wanted an address, we could use the - as shown in Example 5-35.

Example 5-35. Using JSONB - to remove an element
UPDATE persons_b
SET person = person - 'address'
WHERE person @> '{"name":"Sonia"}';

The simple - operator works for first-level elements, but what if you wanted to remove an attribute from a particular member? This is when you’d use the #- operator. #- takes an array of text values that denotes the path of the element you want to remove. In Example 5-36 we remove the girl designator of Azaleah.

Example 5-36. Using JSONB #- to remove nested element
UPDATE persons_b
SET  person = person #- '{children,1,girl}'::text[]
WHERE person @> '{"name":"Sonia"}'
RETURNING person->'children'->1;
{"name": "Azaleah", "phones": []}

When removing elements from an array, you need to denote the index. Because JavaScript indexes start at 0, to remove an element from the second child, we use 1 instead of 2. If we wanted to remove Azaleah entirely, we would have used '{children,1}'::text[].

To add a gender attribute, or replace one that was previously set, we can use the jsonb_set function as shown in Example 5-37.

Example 5-37. Using the jsonb_set function to change a nested value
UPDATE persons_b
SET person = jsonb_set(person,'{children,1,gender}'::text[],'"F"'::jsonb, true)
WHERE person @> '{"name":"Sonia"}';

jsonb_set takes three arguments of form jsonb_set(jsonb_to_update, text_array_path, new_jsonb_value,allow_creation). If you set allow_creation to false when the property did not already exist, the statement will return an error.

XML

The XML data type, similar to JSON, is “controversial” in a relational database because it violates the principles of normalization. Nonetheless, all of the high-end relational database products (IBM DB2, Oracle, SQL Server) support XML. PostgreSQL also jumped on the bandwagon and offers plenty of functions to boot. (We’ve authored many articles on working with XML in PostgreSQL.) PostgreSQL comes packaged with functions for generating, manipulating, and parsing XML data. These are outlined in XML Functions. Unlike the jsonb type, there is currently no direct index support for it. So you need to use functional indexes to index subparts, similar to what you can do with the plain json type.

Inserting XML Data

When you create a column of the xml data type, PostgreSQL automatically ensures that only valid XML values populate the rows. This is what distinguishes an XML column from just any text column. However, the XML is not validated against any Document Type Definition (DTD) or XML Schema Definition (XSD), even if it is specified in the XML document. To freshen up on what constitutes valid XML, Example 5-38 shows you how to append XML data to a table by declaring a column as xml and inserting into it as usual.

Example 5-38. Populate an XML field
CREATE TABLE families (id serial PRIMARY KEY, profile xml);
INSERT INTO families(profile)
VALUES (
    '<family name="Gomez">
        <member><relation>padre</relation><name>Alex</name></member>
        <member><relation>madre</relation><name>Sonia</name></member>
        <member><relation>hijo</relation><name>Brandon</name></member>
        <member><relation>hija</relation><name>Azaleah</name></member>
	</family>');

Each XML value could have a different XML structure. To enforce uniformity, you can add a check constraint, covered in “Check Constraints”, to the XML column. Example 5-39 ensures that all family has at least one relation element. The '/family/member/relation' is XPath syntax, a basic way to refer to elements and other parts of XML.

Example 5-39. Ensure that all records have at least one member relation
ALTER TABLE families ADD CONSTRAINT chk_has_relation
CHECK (xpath_exists('/family/member/relation', profile));

If we then try to insert something like:

INSERT INTO families (profile) VALUES ('<family name="HsuObe"></family>');

we will get this error: ERROR: new row for relation "families" violates check constraint "chk_has_relation".

For more involved checks that require checking against DTD or XSD, you’ll need to resort to writing functions and using those in the check constraint, because PostgreSQL doesn’t have built-in functions to handle those kinds of checks.

Querying XML Data

To query XML, the xpath function is really useful. The first argument is an XPath query, and the second is an xml object. The output is an array of XML elements that satisfies the XPath query. Example 5-40 combines xpath with unnest to return all the family members. unnest unravels the array into a row set. We then cast the XML fragment to text.

Example 5-40. Query XML field
SELECT ordinality AS id, family,
    (xpath('/member/relation/text()', f))[1]::text As relation,
    (xpath('/member/name/text()', f))[1]::text As mem_name 1
FROM (
    SELECT
        (xpath('/family/@name', profile))[1]::text As family, 2
        f.ordinality, f.f
        FROM families, unnest(xpath('/family/member', profile)) WITH ORDINALITY AS f
) x; 3
 id | family | relation | mem_name
----+--------+----------+----------
  1 | Gomez  | padre    | Alex
  2 | Gomez  | madre    | Sonia
  3 | Gomez  | hijo     | Brandon
  4 | Gomez  | hija     | Azaleah
(4 rows)
1

Get the text element in the relation and name tags of each member element. We need to use array subscripting because xpath always returns an array, even if only one element is returned.

2

Get the name attribute from family root. For this we use @attribute_name.

3

Break the result of the SELECT into the subelements <member>, <relation>, </relation>, <name>, </name>, and </member> tags. The slash is a way of getting at subtag elements. For example, xpath('/family/member', 'profile') will return an array of all members in each family that is defined in a profile. The @ sign is used to select attributes of an element. So, for example, family/@name returns the name attribute of a family. By default, xpath always returns an element, including the tag part. The text() forces a return of just the text body of an element.

New in version 10 is the ANSI-SQL standard XMLTABLE construct. XMLTABLE converts text of XML into individual rows and columns based on some defined transformation. We’ll repeat Example 5-40 using XMLTABLE.

Example 5-41. Query XML using XMLTABLE
SELECT xt.*
  FROM families,
       XMLTABLE ('/family/member' PASSING profile  1
                 COLUMNS 2
                    id FOR ORDINALITY 3,
                    family text PATH '../@name' 4,
                    relation text NOT NULL 5,
                    member_name text PATH 'name' NOT NULL
                ) AS xt;
 id | family | relation | mem_name
----+--------+----------+----------
  1 | Gomez  | padre    | Alex
  2 | Gomez  | madre    | Sonia
  3 | Gomez  | hijo     | Brandon
  4 | Gomez  | hija     | Azaleah
(4 rows)
1

The first part is an XML path element that defines the row. The word PASSING is followed by the table column to parse out rows. This column has to be of type xml. We use the families.profile column of our families table.

2

The COLUMNS component should define the list of columns to be parsed out of the xml.

3

Similar to WITH ORDINALITY in conjunction with set-returning functions, you can use FOR ORDINALITY to assign numeric order to each record.

4

You can use ../ to move up a level above the base of the row. In this case we use ./@name to get the family name, which is one level above family/member. The @ is used to denote this is an attribute (something of form name='a value') and not an element.

5

If a path element matches the name of your defined column, you don’t need to specify the PATH. In this case, because /family/member/relation matches our column name relation, we can skip the PATH clause.

Full Text Search

I’m sure you’ve seen websites where you can search by typing in keywords. An ecommerce site will bring up a list of matching products; a film site will bring up a list of matching movies; a knowledgebase site will bring up matching questions and answers, etc.

To search textual data by keywords, you have at your disposal the like or ilike (case insensitive) commands. You can also avail yourself of powerful regular expression and Soundex searches. But both of these methods stop short of offering natural language−based match conditions. For example, if you’re looking for LGBT movies and type that abbreviation into your search, you’re going to miss movies described as lesbian, gay, bisexual, or transgendered. If you type in the search term lots of steamy sex scenes, you may end up with nothing unless the description very closely matches what you typed in.

FTS is a suite of tools that adds a modicum of “intelligence” to your searches. Though it’s far from being able to read your mind, it can find words that are close in meaning, rather than spelling. FTS is packaged into PostgreSQL, with no additional installation necessary.

At the core of FTS is an FTS configuration. The configuration codifies the rules under which match will occur by referring to one or more dictionaries. For instance, if your dictionary contains entries that equate the words love, romance, infatuation, lust, then any search by one of the words will find matches with any of the words. Dictionaries may also equate words with the same stem. For example, love, loving, and loved share a common stem. A dictionary could equate all principle parts of a verb; for example, eat, eats, ate, and eaten could be considered the same.

A dictionary can also list stop words. These are usually parts of speech that add little to the meaning. Articles, conjunctions, prepositions, and pronouns such as a, the, on, and that often make up the list of stop words.

Beyond matching synonyms and pruning stop words, FTS can be used to rank searches. FTS can utilize the proximity of words to each other and the frequency of terms in text to rank search results. For example, if you’re interested in viewing movies where sex is depicted with smoking, you could search for the two words sex and smoking, but also specify that the two words must be two words apart and rank higher if they appear in the title. And so they smoked after sex would hit, whereas sex took place in a hotel, which has a foyer for smoking guests would miss. FTS can apply unequal weights to the places where the sought-after words appear in the text. For instance, if you have a movie where the word sex appears in either the title or the byline, you could make this movie rank higher than movies where sex is only in the description.

FTS Configurations

Most PostgreSQL distributions come packaged with over 10 FTS configurations. All these are installed in the pg_catalog schema.

To see the listing of installed FTS configurations, run the query SELECT cfgname FROM pg_ts_config;. Or use the dF command in psql. A typical list follows:

cfgname
----------
simple
danish
dutch
english
finnish
french
german
hungarian
italian
norwegian
portuguese
romanian
russian
spanish
swedish
turkish
(16 rows)

If you need to create your own configurations or dictionaries, refer to PostgreSQL Manual: Full Text Search Configuration and PostgreSQL Manual: Full Text Search Dictionaries.

You’re not limited to built-in FTS configurations. You can create your own. But before you do, you may wish to see what other users have already created that may suit your needs. If your text is medical-related, you may be able to find a configuration with dictionaries chock full of specialized anatomy terms. If your text is in Spanish, find a configuration that tailors to your particular dialect of Spanish.

Once you locate a configuration that you’d like added to your arsenal, installation is quite simple and usually doesn’t require additional compilation. We demonstrate by installing the popular hunspell configuration.

Start by downloading hunspell configurations from hunspell_dicts. You’ll be greeted by hunspell for many different languages. We’ll go with hunspell_en_us:

  1. Download everything in the folder.

  2. Copy en_us.affix and en_us.dict to your PostgreSQL installation directory share/tsearch_data.

  3. Copy the hunspell_en_us--*.sql and hunspell_en_us.control files to your PostgreSQL installation directory share/extension folder.

Next, run:

CREATE EXTENSION hunspell_en_us SCHEMA pg_catalog;

From psql, if you now run Example 5-42, you’ll see details of the hunspell configuration and dictionary we just installed.

Example 5-42. FTS configuration hunspell
dF+ english_hunspell;
Text search configuration "pg_catalog.english_hunspell"
Parser: "pg_catalog.default"
Token           | Dictionaries
----------------+-------------------------------
asciihword      | english_hunspell,english_stem
asciiword       | english_hunspell,english_stem
email           | simple
file            | simple
float           | simple
host            | simple
hword           | english_hunspell,english_stem
hword_asciipart | english_hunspell,english_stem
hword_numpart   | simple
hword_part      | english_hunspell,english_stem
int             | simple
numhword        | simple
numword         | simple
sfloat          | simple
uint            | simple
url             | simple
url_path        | simple
version         | simple
word            | english_hunspell,english_stem
Warning

Keep in mind that not all FTS configurations install in the same way. Read the instructions.

Contrast that output to the built-in English configuration in Example 5-43, which gives you the dictionaries used by the English configuration.

Example 5-43. FTS English configuration
dF+ english;
Text search configuration "pg_catalog.english"
Parser: "pg_catalog.default"
Token           | Dictionaries
----------------+--------------
asciihword      | english_stem
asciiword       | english_stem
email           | simple
file            | simple
float           | simple
host            | simple
hword           | english_stem
hword_asciipart | english_stem
hword_numpart   | simple
hword_part      | english_stem
int             | simple
numhword        | simple
numword         | simple
sfloat          | simple
uint            | simple
url             | simple
url_path        | simple
version         | simple
word            | english_stem

The only difference between the two is that hunspell draws from an additional dictionary.

Not sure which configuration is the default? Run:

SHOW default_text_search_config;

To replace the default with another, run:

ALTER DATABASE postgresql_book 
SET default_text_search_config = 'pg_catalog.english';

This replacement takes place at the database level, but as with most PostgreSQL configuration settings, you can make the change at the server, user, or session levels.

TSVectors

A text column must be vectorized before FTS can search against it. The resultant vector column is a tsvector data type. To create a tsvector from text, you must specify the FTS configuration to use. The vectorization reduces the original text to a set of word skeletons, referred to as lexemes, by removing stop words. For each lexeme, the TSVector records where in the original text it appears. The more frequently a lexeme appears, the higher the weight. Each lexeme therefore is imbued with at least one position, much like a vector in the physical sense.

Use the to_tsvector function to vectorize a blob of text. This function will resort to the default FTS configuration unless you specify another.

Example 5-44 shows how TSVectors differ depending on which FTS configuration was used in their construction.

Example 5-44. TSVector derived from different FTS configurations
SELECT
    c.name,
    CASE
        WHEN c.name ='default' THEN to_tsvector(f.t)
        ELSE to_tsvector(c.name::regconfig,f.t)
    END As vect
FROM (
    SELECT 'Just dancing in the rain. I like to dance.'::text) As f(t), (
        VALUES ('default'),('english'),('english_hunspell'),('simple')
    ) As c(name);
name             | vect
-----------------+-----------------------------------------------------------------------------
default          | 'danc':2,9 'like':7 'rain':5
english          | 'danc':2,9 'like':7 'rain':5
english_hunspell | 'dance':2,9 'dancing':2 'like':7 'rain':5
simple           | 'dance':9 'dancing':2 'i':6 'in':3 'just':1 'like':7 
'rain':5 'the':4 'to':8
(4 rows)

Example 5-44 demonstrates how four different FTS configurations result in different vectors. Note how the English and Hunspell configurations remove all stop words, such as just and to. English and Hunspell also convert words to their normalized form as dictated by their dictionaries, so dancing becomes danc and dance, respectively. The simple configuration has no concept of stemming and stop words.

The to_tsvector function returns where each lexeme appears in the text. So, for example, 'danc':2,9 means that dancing and dance appear as the second and the ninth words.

To incorporate FTS into your database, add a tsvector column to your table. You then either schedule the tsvector column to be updated regularly, or add a trigger to the table so that whenever relevant fields update, the tsvector field recomputes.

For our examples, we gathered fictitious movie data. Load the tables from psql using the file.sql script as follows:

encoding utf8;
i film.sql

Next, we add and compute a tsvector column to the film table as shown in Example 5-45.

Example 5-45. Add tsvector column and populate with weights
ALTER TABLE film ADD COLUMN fts tsvector;
UPDATE film
SET fts =
    setweight(to_tsvector(COALESCE(title,'')),'A') ||
    setweight(to_tsvector(COALESCE(description,'')),'B');
CREATE INDEX ix_film_fts_gin ON film USING gin (fts);

Example 5-45 vectorizes the title and description columns and stores the vector in a newly created tsvector column. To speed up searches, we add a GIN index on the tsvector column. GIN is a lossless index. You can also add a GiST index on a vector column. GiST is lossy and slower to search but builds quicker and takes up less disk space. We explore indexes in more detail in “Indexes”.

By populating the fts column, we’ve introduced two new constructs, the setweight function and the concatenation operator (||), to tsvector.

To distinguish the relative importance of different lexemes, you could assign a weight to each. The weights must be A, B, C, or D, with A ranking highest in importance. In Example 5-45, we assigned A to lexemes culled from the title and B to lexemes from the description. If our search term matches a lexeme from the title, we deem the match to be more relevant than a match from the description of the movie.

TSVectors can be formed from other tsvectors using the concatenation (||) operator. We used it here to combine the title and description into a single tsvector. This way when we search, we have to contend with only a single column.

Should data change in one of the basis columns forming the tsvector, you must re-vectorize. To avoid having to manually run to_tsvector every time data changes, create a trigger that responds to updates. In the trigger, use the handy tsvector_update_trigger function as shown in Example 5-46.

Example 5-46. Trigger to automatically update tsvector
CREATE TRIGGER trig_tsv_film_iu
BEFORE INSERT OR UPDATE OF title, description ON film FOR EACH ROW
EXECUTE PROCEDURE tsvector_update_trigger(fts,'pg_catalog.english',
title,description);

Example 5-46 reacts to an insert or update in the title or description by revectoring the fts column. One shortcoming though: tsvector_update_trigger does not support weighting.

TSQueries

A FTS, or any text search for that matter, has two components: the searched text and the search terms. For FTS to work, both must be vectorized. We have already seen how to vectorize the searched text to create tsvector columns. We now show you how to vectorize the search terms.

FTS refers to vectorized search terms as tsqueries, and PostgreSQL offers several functions that will convert plain-text search terms to tsqueries: to_tsquery, plainto_tsquery, and phraseto_tsquery. The latter is a new function in 9.6 and takes the ordering of words in the search term into consideration.

tsqueries are normally created on the fly rather than being stored in a table. However, if you are building a system where people can save their queries and run them, you could define a tsquery column in a table.

Example 5-47 shows the output using the to_tsquery functions against two configurations: the default English configuration and the Hunspell configuration.

Example 5-47. TSQuery constructions: to_query
SELECT to_tsquery('business & analytics');
to_tsquery
-----------------
'busi' & 'analyt'
SELECT to_tsquery('english_hunspell','business & analytics');
to_tsquery
--------------------------------
('business' | 'busy') & 'analyt'

Both examples are akin to searching for text containing the words business and analytics. The and operator (&) means that both words must appear in the searched text. The or operator (|) means one or both of the words must appear in the searched text. If the configuration in use finds multiple stems for a word, they are stitched together by the or operator.

Warning

You should use the same FTS configuration as the one you used to build the tsvector.

A slight variant of to_tsquery is plain_totsquery. This function automatically inserts the and operator between words for you, saving you a few key clicks. See Example 5-48.

Example 5-48. TSQuery constructions: plainto_query
SELECT plainto_tsquery('business analytics');
plainto_tsquery
-----------------
'busi' & 'analyt'

to_tsquery and plainto_tsquery look only at words, not their sequence. So business analytics and analytics business produce the same tsquery. This is a shortcoming because you’re limited to searching by single words only. Version 9.6 addressed this with the function phraseto_tsquery. In Example 5-49, the phraseto_tsquery vectorizes the words, inserting the distance operator between the words. This means that the searched text must contain the words business and analytics in that order, upgrading a word search to a phrase search.

Example 5-49. TSQuery constructions: phraseto_query
SELECT phraseto_tsquery('business analytics');
phraseto_tsquery
-------------------
'busi' <-> 'analyt'
SELECT phraseto_tsquery('english_hunspell','business analytics');
phraseto_tsquery
---------------------------------------------
'business' <-> 'analyt' | 'busy' <-> 'analyt'

You can also cast text to tsquery without using any functions, as in 'business & analytics'::tsquery. However, with casts, words are not replaced with lexemes and are taken literally.

TSQueries can be combined using the or operator (||) or the and operator (&&). The expression tsquery1 || tsquery2 means matching text must satisfy either tsquery1 or tsquery2. The expression tsquery1 && tsquery2 means matching text must satisfy both tsquery1 and tsquery2.

Examples of each are shown in Example 5-50.

Example 5-50. Combining tsqueries
SELECT plainto_tsquery('business analyst') || phraseto_tsquery('data scientist');
tsquery
-------------------------------------------
'busi' & 'analyst' | 'data' <-> 'scientist'
SELECT plainto_tsquery('business analyst') && phraseto_tsquery('data scientist');
tsquery
--------------------------------------------
'busi' & 'analyst' & ('data' <-> 'scientist')

tsqueries and tsvectors have additional operators for doing things like determining if one is a subset of another, and several other functions. All this is detailed in PostgreSQL Manual: Text Search Functions and Operators.

Using Full Text Search

We have created a tsvector from our text; we have created a tsquery from our search terms. Now, we can perform an FTS. We do so by using the @@ operator. Example 5-51 demonstrates it.

Example 5-51. FTS in action
SELECT left(title,50) As title, left(description,50) as description
FROM film
WHERE fts @@ to_tsquery('hunter & (scientist | chef)') AND title > '';
title                  | description
-----------------------+---------------------------------------------------
ALASKA PHANTOM         | A Fanciful Saga of a Hunter And a Pastry Chef who
CAUSE DATE             | A Taut Tale of a Explorer And a Pastry Chef who mu
CINCINATTI WHISPERER   | A Brilliant Saga of a Pastry Chef And a Hunter who
COMMANDMENTS EXPRESS   | A Fanciful Saga of a Student And a Mad Scientist w
DAUGHTER MADIGAN       | A Beautiful Tale of a Hunter And a Mad Scientist w
GOLDFINGER SENSIBILITY | A Insightful Drama of a Mad Scientist And a Hunter
HATE HANDICAP          | A Intrepid Reflection of a Mad Scientist And a Pio
INSIDER ARIZONA        | A Astounding Saga of a Mad Scientist And a Hunter
WORDS HUNTER           | A Action-Packed Reflection of a Composer And a Mad
(9 rows)

Example 5-51 finds all films with a title or description containing the word hunter and either the word scientist, or the word chef, or both.

If you are running PostgreSQL 9.6, you can specify the proximity and order of words. See Example 5-52.

Example 5-52. FTS with order and proximity
SELECT left(title,50) As title, left(description,50) as description
FROM film
WHERE fts @@ to_tsquery('hunter <4> (scientist | chef)') AND title > '';
title            | description
-----------------+---------------------------------------------------
ALASKA PHANTOM   | A Fanciful Saga of a Hunter And a Pastry Chef who
DAUGHTER MADIGAN | A Beautiful Tale of a Hunter And a Mad Scientist w
(2 rows)

Example 5-52 requires that the word hunter precede scientist or chef by exactly four words.

Ranking Results

FTS includes functions for ranking results. These functions are ts_rank and ts_rank_cd. ts_rank considers only the frequency of terms and weights, while ts_rank_cd (cd stands for coverage density) also considers the position of the search term within the searched text. If lexemes are found closer together, the result ranks higher. ts_rank_cd is meaningful only if you have position markers in your tsvector; otherwise, it returns zero. The frequency with which a search term appears also depends on position markers. So the ts_rank function will consider only weights if positional markers are missing. By default, ts_rank and ts_rank_cd apply the weights 0.1, 0.2, 0.4, and 1.0, respectively, for D, C, B, and A. Example 5-53 follows the default order.

Example 5-53. Ranking search results
SELECT title, left(description,50) As description,
    ts_rank(fts,ts)::numeric(10,3) AS r
FROM film, to_tsquery('english','love  & (wait | indian | mad)') AS ts
WHERE fts @@ ts AND title > ''
ORDER BY r DESC;
title         | description                                        | r
--------------+----------------------------------------------------+------
INDIAN LOVE   | A Insightful Saga of a Mad Scientist And a Mad Sci | 0.999
LAWRENCE LOVE | A Fanciful Yarn of a Database Administrator And a  | 0.252
(2 rows)

Let’s suppose we wish to retrieve a field only if the search terms appear in the title. For this situation we would assign 1 to the title field and 0 to all others. Example 5-54 repeats Example 5-53, passing in an array of weights.

Example 5-54. Ranking search results using custom weights
SELECT
    left(title,40) As title,
    ts_rank('{0,0,0,1}'::numeric[],fts,ts)::numeric(10,3) AS r,
    ts_rank_cd('{0,0,0,1}'::numeric[],fts,ts)::numeric(10,3) As rcd
FROM film, to_tsquery('english', 'love  & (wait | indian | mad )') AS ts
WHERE fts @@ ts AND title > ''
ORDER BY r DESC;
title         | r     | rcd
--------------+-------+------
INDIAN LOVE   | 0.991 | 1.000
LAWRENCE LOVE | 0.000 | 0.000
(2 rows)

Notice how in Example 5-54 the second entry has a ranking of zero because the title does not contain all the words to satisfy the tsquery.

Note

If performance is a concern, you should explicitly declare the FTS configuration in queries instead of allowing the default behavior. As noted in Some FTS Tricks by Oleg Bartunov, you can achieve twice the speed by using to_tsquery('english','social & (science | scientist)') in lieu of to_tsquery('social & (science | scientist)').

Full Text Stripping

By default, vectorization adds markers (location of the lexemes within the vector) and optionally weights (A, B, C, D). If your searches care only whether a particular term can be found, regardless of where it is in the text, how frequently it occurs, or its prominence, you can declutter your vectors using the strip function. This saves disk space and gains some speed. Example 5-55 compares what an unstripped versus stripped vector looks like.

Example 5-55. Unstripped versus stripped vector
SELECT fts
FROM film
WHERE film_id = 1;
'academi':1A 'battl':15B 'canadian':20B 'dinosaur':2A 'drama':5B 'epic':4B
'feminist':8B 'mad':11B 'must':14B 'rocki':21B 'scientist':12B 'teacher':17B
SELECT strip(fts)
FROM film
WHERE film_id = 1;
'academi' 'battl' 'canadian' 'dinosaur' 'drama' 'epic' 'feminist' 'mad'
'must' 'rocki' 'scientist' 'teacher'

Keep in mind that although a stripped vector is faster to search and takes up less disk space, many operators and functions cannot be used in conjunction with them. For instance, because a stripped vector has no markers, distance operators cannot be used.

Full Text Support for JSON and JSONB

New in version 10 are ts_headline and to_tsvector, which take as input json and jsonb data. The functions work just like the text ones, except they consider only the values of json/jsonb data and not the keys or json markup. Example 5-56 applies the function to the json person column of the table we created in Example 5-28.

Example 5-56. Converting json/jsonb to tsvector
SELECT to_tsvector(person)
   FROM persons WHERE id=1;
     to_tsvector
----------------------------------------------------------------------------------
 '-5083':19 '-6719':13 '-722':12 '-852':18 '619':11,17 'alex':3 'azaleah':25
 'brandon':21 'cell':15 'm':23 'ofelia':7 'rafael':5 'sonia':1 'work':9
(1 row)

To apply this function to the jsonb table persons_b, swap out the persons table for persons_b. Similar to the to_tsvector for text, these functions also have a variant that takes the FTS configuration to use as their first argument. To make best use of these functions, create a tsvector column in your table and populate the field using either a trigger or update as needed.

Also available now for json and jsonb is the ts_headline function, which tags as HTML all matching text in the json document. Example 5-57 flags all references to Rafael in the document.

Example 5-57. Tag matching words
SELECT ts_headline(person->'spouse'->'parents', 'rafael'::tsquery)
FROM persons_b WHERE id=1;
{"father": "<b>Rafael</b>", "mother": "Ofelia"}
(1 row)

Note the bold HTML tags around the matching value.

Custom and Composite Data Types

This section demonstrates how to define and use a custom type. The composite (aka record, row) object type is often used to build an object that is then cast to a custom type, or as a return type for functions needing to return multiple columns.

All Tables Are Custom Data Types

PostgreSQL automatically creates custom types for all tables. For all intents and purposes, you can use custom types just as you would any other built-in type. So we could conceivably create a table that has a column type that is another table’s custom type, and we can go even further and make an array of that type. We demonstrate this “turducken” in Example 5-58.

Example 5-58. Turducken
CREATE TABLE chickens (id integer PRIMARY KEY);
CREATE TABLE ducks (id integer PRIMARY KEY, chickens chickens[]);
CREATE TABLE turkeys (id integer PRIMARY KEY, ducks ducks[]);

INSERT INTO ducks VALUES (1, ARRAY[ROW(1)::chickens, ROW(1)::chickens]);
INSERT INTO turkeys VALUES (1, array(SELECT d FROM ducks d));

We create an instance of a chicken without adding it to the chicken table itself; hence we’re able to repeat id with impunity. We take our array of two chickens, stuff them into one duck, and add it to the ducks table. We take the duck we added and stuff it into the turkeys table.

Finally, let’s see what we have in our turkey:

SELECT * FROM turkeys;
output
--------------------------
id | ducks
---+----------------------
1  | {"(1,"{(1),(1)}")"}

We can also replace subelements of our turducken. This next example replaces our second chicken in our first turkey with a different chicken:

UPDATE turkeys SET ducks[1].chickens[2] = ROW(3)::chickens
WHERE id = 1 RETURNING *;
output
--------------------------
id | ducks
---+----------------------
 1 | {"(1,"{(1),(3)}")"}

We used the RETURNING clause as discussed in “Returning Affected Records to the User” to output the changed record.

Any complex row or column, regardless of how complex, can be converted to a json or jsonb column like so:

SELECT id, to_jsonb(ducks) AS ducks_jsonb
FROM turkeys;
id | ducks_jsonb
---+------------------------------------------------
 1 | [{"id": 1, "chickens": [{"id": 1}, {"id": 3}]}]
(1 row)

PostgreSQL internally keeps track of object dependencies. The ducks.chickens column is dependent on the chickens table. The turkeys.ducks column is dependent on the ducks table. You won’t be able to drop the chickens table without specifying CASCADE or first dropping the ducks.chickens column. If you do a CASCADE, the ducks.chickens column will be gone, and without warning, your turkeys will have no chickens in their ducks.

Building Custom Data Types

Although you can easily create composite types just by creating a table, at some point, you’ll probably want to build your own from scratch. For example, let’s build a complex number data type with the following statement:

CREATE TYPE complex_number AS (r double precision, i double precision);

We can then use this complex number as a column type:

CREATE TABLE circuits (circuit_id serial PRIMARY KEY, ac_volt complex_number);

We can then query our table with statements such as:

SELECT circuit_id, (ac_volt).* FROM circuits;

or an equivalent:

SELECT circuit_id, (ac_volt).r, (ac_volt).i FROM circuits;
Warning

Puzzled by the parentheses surrounding ac_volt? If you leave them out, PostgreSQL will raise the error missing FROM-clause entry for table “ac_volt” because it assumes ac_volt without parentheses refers to a table.

Composites and NULLs

NULL is a confusing concept in the ANSI SQL Standard, primarily because NULL != NULL. When working with NULLs, instead, you need to use IS NULL, IS NOT NULL, or NOT (somevalue IS NULL). With noncomposite types, something IS NULL is generally the antithesis to something IS NOT NULL. This is not the case with composites, however.

PostgreSQL abides by the ANSI SQL standard specs when dealing with NULLs. The specs require that in order for a composite to be IS NULL, all elements of the composite must be NULL. Here is where confusion can enter. In order for a composite to be considered IS NOT NULL, every element in the composite must return true for IS NOT NULL.

Building Operators and Functions for Custom Types

After you build a custom type such as a complex number, naturally you’ll want to create functions and operators for it. We’ll demonstrate building a + operator for the complex_number we created. For more details about building functions, see Chapter 8. As stated earlier, an operator is a symbol alias for a function that takes one or two arguments. You can find more details about what symbols and sets of symbols are allowed in CREATE OPERATOR.

In addition to being an alias, an operator contains optimization information that can be used by the query optimizer to decide how indexes should be used, how best to navigate the data, and which operator expressions are equivalent. More details about these optimizations and how each can help the optimizer are in Operator Optimization.

The first step to creating an operator is to create a function, as shown in Example 5-59.

Example 5-59. Add function for complex number
CREATE OR REPLACE FUNCTION add(complex_number, complex_number) 
RETURNS complex_number AS
$$
    SELECT
        ((COALESCE(($1).r,0) + COALESCE(($2).r,0)),
        (COALESCE(($1).i,0) + COALESCE(($2).i,0)))::complex_number;
$$
language sql;

The next step is to create a symbolic operator to wrap the function, as in Example 5-60.

Example 5-60. + operator for complex number
CREATE OPERATOR + (
    PROCEDURE = add,
    LEFTARG = complex_number,
    RIGHTARG = complex_number,
    COMMUTATOR = +
);

We can then test our new + operator:

SELECT (1,2)::complex_number + (3,-10)::complex_number;

which outputs (4,-8).

Although we didn’t demonstrate it here, you can overload functions and operators to take different types as inputs. For example, you can create an add function and companion + operator that takes a complex_number and an integer.

The ability to build custom types and operators pushes PostgreSQL to the boundary of a full-fledged development environment, bringing us ever closer to our utopia where everything is table-driven.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset