Chapter 10. Replication and External Data

PostgreSQL has a number of options for sharing data with external servers or data sources. The first option is the built-in replication options of PostgreSQL, which allow you to create a copy of your server ready to run on another PostgreSQL server. The second option is to use third-party add-ons, many of which are freely available and time-tested. The third option is to use a foreign data wrapper (FDW). FDWs give you the flexibility to query from a wide array of external data sources. Since version 9.3, some FDWs also permit updating: these include postgres_fdw, hadoop_fdw, and ogr_fdw (see “Querying Other Tabular Formats with ogr_fdw”).

Replication Overview

The reasons for replicating your databases distill down to two: availability and scalability. Availability is assured by providing a redundant server so that, if your main server goes down, you have another that can immediately assume its role. For small databases, you could just make sure you have another physical server ready and restore the database onto it. But for large databases (in the terabytes), the restore itself could take hours, if not days. To avoid downtime, you’ll need to replicate.

The other motivation for replications is scalability. Suppose you set up a database to breed fancy elephant shrews for profit. After a few years of breeding, you now have thousands of elephant shrews. People all over the world come to your site to gawk and purchase. You’re overwhelmed by the traffic, but replication comes to your aid. You arrange a read-only slave server to replicate with your main server. Then you direct the countless gawkers to the slave, and let only serious buyers onto the master server to finalize their purchases.

Replication Jargon

Before we get too carried away, we should introduce some common lingo in PostgreSQL replication:

Master

The master server is the database server sourcing the data being replicated and where all updates take place. You’re allowed only one master when using the built-in server replication features of PostgreSQL. Plans are in place to support multimaster replication scenarios. Watch for it in future releases. You may also hear the term publisher used to mean the provider of the data. Publisher/subscriber terminology gains more traction in PostgreSQL 10 for built-in logical replication.

Slave

A slave server consumes the replicated data and provides a replica of the master. More aesthetically pleasing terms such as subscriber and agent have been bandied about, but slave is still the most apropos. PostgreSQL built-in replication supports only read-only slaves at this time.

Write-ahead log (WAL)

WAL is the log that keeps track of all transactions, often referred to as the transaction log in other database products. To stage replication, PostgreSQL simply makes the logs available to the slaves. Once slaves have pulled the logs, they just need to execute the transactions therein.

Synchronous replication

A transaction on the master will not be considered complete until at least one synchronous slave listed in synchronous_standby_names updates and reports back. Prior to version 9.6, if any synchronous slave responds, the transaction is complete. In version 9.6 and higher, the number of standbys that must respond is configurable using the synchronous_standby_names postgresql.conf configuration variable. Version 10 introduced the keywords FIRST and ANY that can be added to the synchronous_standby_names configuration variable that dictates which nodes need to report back. FIRST is the default behavior if not specified and the behavior of 9.6.

Asynchronous replication

A transaction on the master will commit even if no slave updates. This is expedient for distant servers where you don’t want transactions to wait because of network latency, but the downside is that your dataset on the slave might lag behind. Should the lag be severe, the slave might need to be reinitialized if the transaction it needs to continue has already been removed from the WAL logs.

To minimize the risk of WALs being removed before all slaves have used them, version 9.4 introduced replication slots. A replication slot is a contract between a slave and its master whereby the master will not wipe out any WAL logs that are still needed by any replication slots. The hazard is that if a slave holding a replication slot fails or loses communication for a long time, the master will keep the WALS indefinitely and run out of disk space and shut down.

Streaming replication

The slave does not require direct file access between master and slaves. Instead, it relies on the PostgreSQL connection protocol to transmit the WALs.

Cascading replication

Slaves can receive logs from nearby slaves instead of directly from the master. This allows a slave to behave like a master for replication purposes. The slave remains read-only. When a slave acts both as a receiver and a sender, it is called a cascading standby.

Logical replication

This is a new replication option in version 10 that allows the replication of individual tables instead of requiring the whole server cluster to be replicated. It relies on a feature called logical decoding, which extracts changes to a database table from the WAL logs in an easy-to-understand format without detailed knowledge of the database’s internal state. Logical decoding has existed since 9.4 and has been used by some extensions for auditing and providing replication. This new feature comes with the new DDL commands CREATE PUBLICATION and CREATE SUBSCRIPTION for designating what tables to replicate and what servers and corresponding database to send data to.

To use this feature, you must set wal_level to logical.

Refer to Logical Replication in PostgreSQL 10 for an example of its use.

Remastering

Remastering promotes a slave to be the master. Version 9.3 introduced streaming-only remastering, which eliminates the need for remastering to consult a WAL archive; it can be done via streaming, and slaves no longer need to be recloned. As of version 9.4, though, a restart is still required. This may change in future releases.

PostgreSQL binary replication replicates only changes that are transactional. Because any DDL command is transactional, the creation of tables, views, and installation of extensions can be replicated as well. But because unlogged table inserts and updates are not transactional, they cannot be replicated. When installing extensions, you should make sure all slaves have the binaries for the extension and version of extension you are installing; otherwise, replication will fail when the CREATE EXTENSION command is executed on the master.

Evolution of PostgreSQL Replication

PostgreSQL’s stock replication relies on WAL shipping. Streaming replication slaves should be running the same OS and bitness (32-bit/64-bit) as the master. It is also recommended that all servers be running the same minor version as the master, though running the same patch level (microversion) is not required. Though not recommended, the slave and master can be running a different minor version. In this case, it’s preferable for the slave to be running a newer minor version than the master.

Support for built-in replication improved over the following PostgreSQL releases:

  • Version 9.4 added replication slots. A replication slot is a contract between a master and a slave that requires the master to hold on to WALs until a slave is done processing them.

  • Version 9.5 added several functions for monitoring the progress of replication: refer to Replication Progress Tracking in the documentation.

  • Version 9.6 introduced multiple standby servers in synchronous replication for increased reliability.

  • Version 10 introduced built-in logical replication, which allows the replication of individual tables. The other benefit of logical replication is that a slave can have databases and tables of its own that are not part of replication and that can be updated on the slave. Version 10 also introduced temporary replication slots, which allow a process to create a replication slot on a one-time basis and have it disappear after the session is over. This is particularly useful for initializing a new copy of the server via pg_basebackup.

    Although logical replication is built into PostgreSQL for the first time in version 10, you can use logical replication in PostgreSQL 9.4 and higher versions of PostgreSQL 9 through the open source PostgreSQL extension pglogical. If you need to replicate between version 10 and versions 9.4−9.6, you’ll need to have pglogical installed on both version 10 and the lower-versioned server. For logical replication between version 10 and future versions of PostgreSQL, you can use the built-in logical replication feature.

Third-Party Replication Options

As alternatives to PostgreSQL’s built-in replication, common third-party options abound. Slony and Bucardo are two popular open source ones. Although PostgreSQL is improving replication with each new release, Slony, Bucardo, and other third-party replication options still offer more flexibility. Slony and Bucardo allow you to replicate individual databases or even tables instead of the entire server. They also don’t require that all masters and slaves run the same PostgreSQL version and OS. Both also support multimaster scenarios. However, both rely on additional triggers and possible addition of columns to tables to initiate the replication and often don’t replicate DDL commands for rare actions such as creating new tables, installing extensions, and so on. Thus, they require more manual intervention, such as the addition of triggers, additional table fields, or views.

We urge you to consult a comparison matrix of popular third-party options before deciding what to use.

Setting Up Full Server Replication

Let’s go over the steps to replicate the whole server cluster. We’ll take advantage of streaming replication. Recall that streaming replication only requires connections at the PostgreSQL database level between the master and slaves.

Configuring the Master

The steps for setting up the master are:

  1. Create a replication account:

    CREATE ROLE pgrepuser REPLICATION LOGIN PASSWORD 'woohoo';
  2. Alter the following configuration settings in postgresql.auto.conf. These can be done using ALTER SYSTEM set variable=value followed by SELECT pg_reload_conf(); without the need to touch the physical config file:

    listen_addresses = *
    wal_level = hot_standby
    archive_mode = on
    max_wal_senders = 5
    wal_keep_segments = 10

    If you want to use logical replication to do partial replication of only some tables, you’ll need to set wal_level = logical. Logical does more logging than hot_standby so will also work for doing full server replication.

    These settings are described in Server Configuration: Replication. You may want to set wal_keep_segments higher if your servers are far apart and your production server has a lot of transactions. If you are running version 9.6 or above, you should use replica instead of hot_standby for the wal_level. hot_standby is still accepted in 9.6 for backward compatibility, but will be read as replica.

  3. Add the archive_command configuration directive to postgresql.auto.conf or use ALTER SYSTEM to indicate where the WALs will be saved. With streaming, you’re free to choose any directory. More details on this setting can be found in the PostgreSQL PGStandby documentation.

    On Linux/Unix, your archive_command line should look something like:

    archive_command = 'cp %p ../archive/%f'

    You can also use rsync instead of cp if you want to store the WALs on a different server:

    archive_command = 'rsync -av %p [email protected]:archive/%f'

    On Windows:

    archive_command = 'copy %p ..\archive\%f'
  4. Add a rule to pg_hba.conf allowing the slaves to replicate. As an example, the following rule will allow a PostgreSQL account named pgrepuser on a server on your private network with an IP address in the range 192.168.0.1 to 192.168.0.254 to replicate using an md5 password:

    host replication pgrepuser 192.168.0.0/24 md5
  5. Restart the PostgreSQL service for the settings to take effect.

    Use the pg_basebackup utility, found in the bin folder of your PostgreSQL installation, to create a cluster backup. This will create a copy of the data cluster files in the specified directory.

    When using pg_basebackup, use the --xlog-method-stream switch to also copy over the WAL logs and the -R switch to automatically create a config file. The command --xlog-method-stream will spawn another database connection for copying the WALs.

    Note

    In version 10 and above, the pg_xlog directory is pg_wal.

    In the following example, we are on the slave server and performing a streaming basebackup from our master server (192.168.0.1):

    pg_basebackup -D /target_dir -h 192.168.0.1 
    --port=5432 --checkpoint=fast
    --xlog-method=stream -R

If you are using pg_basebackup primarily for backup purposes, you can use the tarred/compressed form, which will create a tar.gz file in the target_dir folder for each table space. -X is shorthand for --xlog-method. The tarred/compression format does not support streaming logs, so you have to resort to fetching the logs with that format:

pg_basebackup -Z9 -D /target_dir/ -h 192.168.0.1 -Ft -Xfetch

For backup, you will want to augment your backup to include transaction log shipping backup using pg_receivexlog for versions prior to 10. For versions 10 and above, pg_receivexlog was renamed to pg_receivewal. This you’ll want to keep running as a cronjob or service to continually make log backups.

Configuring the Slaves for Full Server Cluster Replication

This part is not needed for logical replication. To minimize headaches, slaves should have the same configuration as the master, especially if you’ll be using them for failover. They must also have the same set of PostgreSQL extensions installed in binary; otherwise, when CREATE EXTENSION is played back, it will fail and stop restore. In order for the server to be a slave, it must be able to play back the WAL transactions of the master. The steps for creating a slave are as follows:

  1. Create a new instance of PostgreSQL with the same version (preferably even microversions) as your master server. For PostgreSQL, keeping servers identical for microversions is not a requirement, and you’re welcome to experiment and see how far you can deviate.

  2. Shut down PostgreSQL on the new slave.

  3. Overwrite the data folder files with those you generated with pg_basebackup.

  4. Add the following configuration setting to the postgresql.auto.conf file:

    hot_standby = on
    max_connections = 20 #set to higher or equal to master
  5. You don’t need to run the slaves on the same port as the master, so you can optionally change the port either via postgresql.auto.conf, postgresql.conf, or via some other OS-specific startup script that sets the PGPORT environment variable before startup.

  6. Create a new file in the data folder called recovery.conf with the following contents, but substitute the actual hostname, IP address, and port of your master on the second line. This file is automatically created if you used pg_basebackup. You will have to add the trigger_file line though.

    The application_name is optional but useful if you want to track the replica in postgresql system views:

    standby_mode = 'on'
    primary_conninfo = 'host=192.168.0.1 port=5432 user=pgrepuser password=woohoo application_name=replica1'
    trigger_file = 'failover.now'
  7. If you find that the slave can’t play back WALs fast enough, you can specify a location for caching. In that case, add to the recovery.conf file a line such as the following, which varies depending on the OS.

    On Linux/Unix:
    restore_command = 'cp %p ../archive/%f'
    On Windows:
    restore_command = 'copy %p ..\archive\%f'

    In this example, the archive folder is where we’re caching.

Initiating the Streaming Replication Process

After you have made the basebackup with pg_basebackup and put it in place, verify that the settings in the recovery.conf look right. Then start up the slave server.

You should now be able to connect to both servers. Any changes you make on the master, even structural changes such as installing extensions or creating tables, should trickle down to the slave. You should also be able to query the slave.

When and if the time comes to liberate a chosen slave, create a blank file called failover.now in the data folder of the slave. PostgreSQL will then complete playback of the WAL and rename the recovery.conf file to recover.done. At that point, your slave will be unshackled from the master and continue life on its own with all the data from the last WAL. Once the slave has tasted freedom, there’s no going back. In order to make it a slave again, you’ll need to go through the whole process from the beginning.

Replicating Only Some Tables or Databases with Logical Replication

New in version 10 is the ability to replicate only some of the tables or some of the databases in your master using an approach called logical replication. One big benefit of logical replication is you can use it to replicate between a PostgreSQL 10 database and future versions of PostgreSQL and even replicate when OS platforms or architectures are different. For example, you can use it to replicate between a Linux server and a Windows server.

In logical replication, the server providing the data is called the publisher and the server receiving the data is called the subscriber. You use CREATE PUBLICATION on the publishing server in the database with tables you want to publish to dictate what tables to replicate and CREATE SUBSCRIPTION on the subscriber database denoting the server and publication name it should subscribe to. The main caveat with logical replication is that DDL is not replicated, so in order to replicate a table, the table structure must exist on both the publisher database and the subscriber database.

We have two PostgreSQL 10 servers running on our server. The publisher is on port 5447 and the subscriber is on port 5448. The process is the same if clusters are on separate servers. To replicate:

  1. Make sure the following configuration setting is set on the publisher:

    SHOW wal_level

    If anything other than logical, do:

    ALTER SYSTEM SET wal_level = logical;

    And then restart the postgres service.

    This can be set on the subscription server as well, especially if in some cases the subscription server will act as a publisher for some tables or databases.

  2. On the database where you will be replicating data, create the table structures for tables you will be replicating. If you have a lot of tables or want to replicate a whole database, as we will be doing, use pg_dump on the publishing database to create backup structure of tables. For example, for the postgresql_book database, we would dump out the structure:

    pg_dump -U postgres -p5447 -Fp --section pre-data --section post-data 
    -f pub_struct.sql postgresql_book

    And then use psql on the subscriber server to create our subscription database with structures as follows:

    CREATE DATABASE book_sub;
    connect book_sub;
    i pub_struct.sql
  3. We then create a publication on the publisher database of items we want to replicate. For this exercise, we’ll replicate all the tables in the database using CREATE PUBLICATION. Note that this command will also replicate future tables created, though we’ve had to create the structure on the subscription databases:

    CREATE PUBLICATION full_db_pub
        FOR ALL TABLES;
  4. In order to use the publication, we need to subscribe to it. We do this by executing this command when connected to the subscriber database book_sub:

    connect book_sub;
     CREATE SUBSCRIPTION book_sub
        CONNECTION 'host=localhost port=5447 dbname=postgresql_book 
    user=postgres'
        PUBLICATION full_db_pub;

When you inspect the tables on the book_sub database, you should find that all the tables are full of data collected during the initial synchronization. If you add data to the postgresql_book database, you should see the new records appear on the book_sub database.

If you no longer need a subscription or publication, you can drop them from the publisher with DROP SUBSCRIPTION and DROP PUBLICATION.

Foreign Data Wrappers

FDWs are an extensible, standard-compliant method for your PostgreSQL server to query other data sources, both other PostgreSQL servers and many types of non-PostgreSQL data sources. At the center of the architecture is a foreign table, a table that you can query like other tables in your PostgreSQL database but that resides on another database, perhaps even on another physical server. Once you put in the effort to establish foreign tables, they persist in your database and you’re forever free from having to worry about the intricate protocols of communicating with alien data sources. You can also find the status of popular FDWs and examples of usage at PostgreSQL Wiki FDW. You can find a catalog of some FDWs for PostgreSQL at PGXN FDW and PGXN Foreign Data Wrapper. You’ll find the source code for many of these and for additional ones on GitHub by searching for PostgreSQL Foreign Data Wrappers. If you need to wrap foreign data sources, start by visiting these links to see whether someone has already done the work of creating wrappers. If not, try creating one yourself. If you succeed, be sure to share it with others.

Most PostgreSQL installs provide two FDWs; you can install file_fdw and postgres_fdw using the CREATE EXTENSION command.

Up through PostgreSQL 9.2, you could use FDWs only to read from foreign sources. Version 9.3 introduced an API feature to update foreign tables as well. postgres_fdw supports updates.

In this section, we’ll demonstrate how to register foreign servers, foreign users, and foreign tables, and finally, how to query foreign tables. Although we use SQL to create and delete objects in our examples, you can perform the exact same commands using pgAdmin.

Querying Flat Files

The file_fdw wrapper is packaged as an extension. To install it, use the following SQL:

CREATE EXTENSION file_fdw;

Although file_fdw can read only from file paths accessible by your local server, you still need to define a server for it for the sake of consistency. Issue the following command to create a “faux” foreign server in your database:

CREATE SERVER my_server FOREIGN DATA WRAPPER file_fdw;

Next, you must register the tables. You can place foreign tables in any schema you want. We usually create a separate schema to house foreign data. For this example, we’ll use our staging schema, as shown in Example 10-1.

Here are a few initial lines of the pipe-delimited file we are linking to, to show the format of the data we are taking in:

Dev|Company
Tom Lane|Crunchy Data
Bruce Momjian|EnterpriseDB
Example 10-1. Make a foreign table from a delimited file
CREATE FOREIGN TABLE staging.devs (developer VARCHAR(150), company VARCHAR(150))
SERVER my_server
OPTIONS (
    format 'csv',
    header 'true',
    filename '/postgresql_book/ch10/devs.psv',
    delimiter '|',
    null ''
);

In our example, even though we’re registering a pipe-delimited file, we still use the csv option. A CSV file, as far as FDW is concerned, represents a file delimited by any specified character.

When the setup is finished, you can finally query your pipe-delimited file directly:

SELECT * FROM staging.devs WHERE developer LIKE 'T%';

Once you no longer need the foreign table, drop it using:

DROP FOREIGN TABLE staging.devs;

Querying Flat Files as Jagged Arrays

Often, flat files have a different number of columns on each line and could include multiple header and footer rows. Our favorite FDW for handling these files is file_textarray_fdw. This wrapper can handle any kind of delimited flat file, even if the number of elements vary from row to row, by treating each row as a text array (text[]).

Unfortunately, file_textarray_fdw is not part of the core PostgreSQL, so you’ll need to compile it yourself. First, install PostgreSQL with PostgreSQL development headers. Then download the file_textarray_fdw source code from the Adunstan GitHub site. There is a different branch for each version of PostgreSQL, so make sure to pick the right one. Once you’ve compiled the code, install it as an extension, as you would any other FDW.

If you are on Linux/Unix, it’s an easy compile if you have the postgresql-dev package installed. We did the work of compiling for Windows; you can download our binaries from one of the following links: one for Windows 32/64 9.4 FDWs, and another for Windows 32/64 9.5 and 32/64 9.6 FDWs.

The first step to perform after you have installed an FDW is to create an extension in your database:

CREATE EXTENSION file_textarray_fdw;

Then create a foreign server as you would with any FDW:

CREATE SERVER file_taserver FOREIGN DATA WRAPPER file_textarray_fdw;

Next, register the tables. You can place foreign tables in any schema you want. In Example 10-2, we use our staging schema again.

Example 10-2. Make a file text array foreign table from a delimited file
CREATE FOREIGN TABLE staging.factfinder_array (x text[])
SERVER file_taserver
OPTIONS (
    format 'csv',
    filename '/postgresql_book/ch10/DEC_10_SF1_QTH1_with_ann.csv',
    header 'false',
    delimiter ',',
    quote '"',
    encoding 'latin1',
    null ''
);

Our example CSV begins with eight header rows and has more columns than we care to count. When the setup is finished, you can finally query our delimited file directly. The following query will give us the names of the header rows where the first column of the header is GEO.id:

SELECT unnest(x) FROM staging.factfinder_array WHERE x[1] = 'GEO.id'

This next query will give us the first two columns of our data:

SELECT x[1] As geo_id, x[2] As tract_id 
FROM staging.factfinder_array WHERE x[1] ~ '[0-9]+';

Querying Other PostgreSQL Servers

The PostgreSQL FDW, postgres_fdw, is packaged with most distributions of PostgreSQL since PostgreSQL 9.3. This FDW allows you to read as well as push updates to other PostgreSQL servers, even different versions.

Start by installing the FDW for the PostgreSQL server in a new database:

CREATE EXTENSION postgres_fdw;

Next, create a foreign server:

CREATE SERVER book_server
FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (host 'localhost', port '5432', dbname 'postgresql_book');

If you need to change or add connection options to the foreign server after creation, you can use the ALTER SERVER command. For example, if you needed to change the server you are pointing to, you could enter:

ALTER SERVER book_server OPTIONS (SET host 'prod');
Warning

Changes to connection settings such as the host, port, and database do not take effect until a new session is created. This is because the connection is opened on first use and is kept open.

Next, create a user, mapping its public role to a role on the foreign server:

CREATE USER MAPPING FOR public SERVER book_server
OPTIONS (user 'role_on_foreign', password 'your_password');

The role you map to must exist on the foreign server and have login rights. Anyone who can connect to your database will be able to access the foreign server as well.

Now you are ready to create a foreign table. This table can have a subset of columns of the table it connects to. In Example 10-3, we create a foreign table that maps to the census.facts table.

Example 10-3. Defining a PostgreSQL foreign table
CREATE FOREIGN TABLE ft_facts (
    fact_type_id int NOT NULL,
    tract_id varchar(11),
    yr int, val numeric(12,3),
    perc numeric(6,2)
)
SERVER book_server OPTIONS (schema_name 'census', table_name 'facts');

This example includes only the most basic options for the foreign table. By default, all PostgreSQL foreign tables are updatable, unless the remote account you use doesn’t have update access. The updatable setting is a Boolean setting that can be changed at the foreign table or the foreign server definition. For example, to make your table read-only, execute:

ALTER FOREIGN TABLE ft_facts OPTIONS (ADD updatable 'false');

You can set the table back to updatable by running:

ALTER FOREIGN TABLE ft_facts OPTIONS (SET updatable 'true');

The updatable property at the table level overrides the foreign server setting.

In addition to changing OPTIONS, you can also add and drop columns with the ALTER FOREIGN TABLE .. DROP COLUMN statement.

PostgreSQL 9.5 introduced the IMPORT FOREIGN SCHEMA command, which saves a great deal of time by automatically creating the foreign tables for you. Not all FDWs support IMPORT FOREIGN SCHEMA. Each FDW can also support a custom set of server options when importing. postgres_fdw supports the following custom options:

import_collate

This copies the collation settings from the foreign server for the foreign tables. The default for this setting is true.

import_default

This controls whether default values for columns should be included. The default for the option is false, so columns on the local server have no defaults. But default values are useful during inserts: if you neglect to specify the value of a column, PostgreSQL automatically inserts the default. Be careful, though—the behavior of default could be unexpected if you’re relying on a sequence for auto-numbering. The next assigned value from the sequence could be different between the foreign server and the local server.

import_not_null

This controls whether NOT NULL constraints are imported. The default is true.

In Example 10-4, we import all tables in our books.public schema.

Example 10-4. Use IMPORT FOREIGN SCHEMA to link all tables in a schema
CREATE SCHEMA remote_census;
IMPORT FOREIGN SCHEMA public
FROM SERVER book_server 
INTO remote_census
OPTIONS (import_default 'true');

The IMPORT FOREIGN SCHEMA, as shown in Example 10-4, will create foreign tables with the same names as those in the foreign schema and create them in the designated schema remote_census.

To bring in only a subset of tables, use LIMIT TO or EXCEPT modifiers. For example, to bring in just the facts and lu_fact_types tables, we could have written:

IMPORT FOREIGN SCHEMA census
    LIMIT TO (facts, lu_fact_types)
    FROM SERVER book_server INTO remote_census;

If a table specified in the LIMIT TO does not exist on the remote server, no error will be thrown. You might want to verify after the import that you have all the foreign tables you expected.

A companion clause to LIMIT TO is the EXCEPT clause. Instead of bringing in tables listed, it brings in tables not listed.

If you take advantage of PostgreSQL extensions, you’ll want to use the performance enhancement foreign server option introduced in version 9.6, called extensions. To utilize it, add the option to an existing postgres_fdw server as we do in the following example:

ALTER SERVER census(OPTION ADD extensions 'btree_gist, pg_trgm');

The extensions option is a comma-separated list of extensions installed on the foreign server. When PostgreSQL runs a query involving any of the types or functions defined in the extension in a WHERE clause, it will try to push the function calls to the remote server for improved performance. If the extensions option is not specified, all extension functions will be run locally, which may require transferring more data.

Querying Other Tabular Formats with ogr_fdw

There are many FDWs for querying other relational databases or flat file formats. Most FDWs target a specific kind of data source. For example, you can find the MongoDB FDW for querying MongoDb data, Hadoop FDW for querying Hadoop datasources, and MySQL FDW for querying MySQL data sources.

There are two FDWs we are aware of that bundle many formats. Multicorn FDW is really an FDW API that allows you to write your own FDW in Python. There are some ready-made drivers available, but the Multicorn FDW currently has no offering on Windows and is often tricky to get working on Linux.

ogr_fdw is another FDW that supports many formats, and the one we’ll demonstrate in this section. ogr_fdw supports many tabular formats, such as spreadsheets, Dbase files, and CSVs, as well as other relational databases. It is also a spatial database driver that transforms spatial columns from other databases like SQL Server or Oracle into the PostGIS PostgreSQL spatial geometry type.

Several packages that distribute PostGIS also offer the ogr_fdw extension. For instance, the PostGIS Bundle for Windows found on the stack builder includes the ogr_fdw extension, ogr_fdw for CentOS/RHEL is available via yum.postgresql.org, and BigSQL Linux/Mac/Windows PostgreSQL distribution also offers ogr_fdw. If you need or want to compile it yourself, the source for ogr_fdw is on GitHub.

Underneath the hood, ogr_fdw relies on the Geospatial Data Abstraction Library (GDAL) to do the heavy lifting. Therefore, you need to have GDAL compiled and installed before being able to compile or use ogr_fdw. GDAL has undergone quite a few evolutions, and its capabilities vary according to the dependencies it was compiled with. So be warned that your GDAL may not be our GDAL. GDAL is generally installed as part of PostGIS, the spatial extension for PostgreSQL. So to make GDAL use easier, we recommend always installing the latest version of PostGIS.

Many GDAL instances come with support for Excel, LibreOffice Calc, ODBC, and various Spatial web services. You will find support for Microsoft Access on Windows, but rarely on Linux/Mac distributions.

After you have installed the ogr_fdw binaries, to enable the ogr_fdw in a particular database, connect to the database and run:

CREATE EXTENSION ogr_fdw;

Foreign servers take on different meanings depending on the type of data source. For example, a folder of CSV files would be considered a server, with each file being a separate table. A Microsoft Excel or LibreOffice Calc workbook would be considered a server, with each sheet in the workbook being a separate table. An SQLite database would be considered a server and each table a foreign table.

The following example links a LibreOffice workbook as a server and corresponding spreadsheets as foreign tables:

CREATE SERVER ogr_fdw_wb
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
    datasource '/fdw_data/Budget2015.ods',
    format 'ODS'
);

CREATE SCHEMA wb_data;
IMPORT FOREIGN SCHEMA ogr_all
FROM SERVER ogr_fdw_wb INTO wb_data;

The ogr_all schema is a catch-all that imports all tables in the foreign server regardless of schema. Some datasources schemas and some don’t. To accommodate all inputs, ogr_fdw (in place of ogr_all) accepts the initial characters of a table name as the schema. So, for example, if you wanted to import just a subset of worksheets where the worksheet name begins with “Finance,” you would replace ogr_all with “Finance”:

CREATE SCHEMA wb_data;
IMPORT FOREIGN SCHEMA "Finance"
FROM SERVER ogr_fdw_wb INTO wb_data;

The schema is case-sensitive, so if the name of a worksheet contains uppercase characters or nonstandard characters, it needs to be quoted.

This next example will create a server pointing to a folder of CSV files. Create a schema ff to house foreign tables for the CSV server. The FDW will then create foreign tables linked to CSV files where the CSV filename begins with Housing in schema ff:

CREATE SERVER ogr_fdw_ff
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (datasource '/fdw_data/factfinder', format 'CSV');
CREATE SCHEMA ff;
IMPORT FOREIGN SCHEMA "Housing"
FROM SERVER ogr_fdw_ff INTO ff;

In the aforementioned example CSV files named Housing_2015.csv and Housing_2016.csv will be linked in as foreign tables in schema ff with names housing_2015 and housing_2016.

ogr_fdw by default launders table names and column names: all uppercase table names and column names are converted to lowercase. If you don’t want this behavior, you can pass in settings in IMPORT FOREIGN SCHEMA to keep table names and column names as they were named in the foreign table. For example:

IMPORT FOREIGN SCHEMA "Housing"
     FROM SERVER ogr_fdw_ff INTO ff
      OPTIONS(launder_table_names 'false', launder_column_names 'false');

This creates the tables with names Housing_2015 and Housing_2016, where the column names of the tables would appear in the same case as they are in the header of the files.

Querying Nonconventional Data Sources

The database world does not appear to be getting more homogeneous. Exotic databases are spawned faster than virile elephants. Some are fads and quickly drown in their own hype. Some aspire to dethrone relational databases altogether. Some could hardly be considered databases. The introduction of FDWs is in part a response to the growing diversity. FDW assimilates without compromising the PosgreSQL core.

In this next example, we’ll demonstrate how to use the www_fdw FDW to query web services. We borrowed the example from www_fdw Examples.

The www_fdw FDW is not generally packaged with PostgreSQL. If you are on Linux/Unix, it’s an easy compile if you have the postgresql-dev package installed and can download the latest source. We did the work of compiling for some Windows platforms; you can download our binaries from Windows-32 9.1 FDWs and Windows-64 9.3 FDWs.

Now create an extension to hold the FDW:

CREATE EXTENSION www_fdw;

Then create your Google foreign data server:

CREATE SERVER www_fdw_server_google_search
    FOREIGN DATA WRAPPER www_fdw
    OPTIONS (uri 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0');

The default format supported by www_fdw is JSON, so we didn’t need to include it in the OPTIONS modifier. The other supported format is XML. For details on additional parameters that you can set, refer to the www_fdw documentation. Each FDW is different and comes with its own API settings.

Next, establish at least one user for your FDW. All users that connect to your server should be able to access the Google search server, so here we create one for the entire public group:

CREATE USER MAPPING FOR public SERVER www_fdw_server_google_search;

Now create your foreign table, as shown in Example 10-5. Each field in the table corresponds to a GET parameter in the URL that Google creates for a search.

Example 10-5. Make a foreign table from Google
CREATE FOREIGN TABLE www_fdw_google_search (
    q text,
    GsearchResultClass text,
    unescapedUrl text,
    url text,
    visibleUrl text,
    cacheUrl text,
    title text,
    content text
) SERVER www_fdw_server_google_search;

The user mapping doesn’t assign any rights. You still need to grant rights before being able to query the foreign table:

GRANT SELECT ON TABLE www_fdw_google_search TO public;

Now comes the fun part. We search with the term New in PostgreSQL 9.4 and mix in a bit of regular expression goodness to strip off HTML tags:

SELECT regexp_replace(title,E'(?x)(< [^>]*? >)','','g') As title
FROM www_fdw_google_search
WHERE q = 'New in PostgreSQL 10'
LIMIT 2;

Voilà! We have our response:

title
---------------------
PostgreSQL 10 Roadmap
PostgreSQL: Roadmap
(2 rows)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset