PostgreSQL has a number of options for sharing data with external
servers or data sources. The first option is the built-in replication options of PostgreSQL, which allow you to create a
copy of your server ready to run on another PostgreSQL server. The second
option is to use third-party add-ons, many of which are freely available and
time-tested. The third option is to use a foreign data wrapper (FDW). FDWs
give you the flexibility to query from a wide array of external data
sources. Since version 9.3, some FDWs also permit updating: these include postgres_fdw
, hadoop_fdw
, and ogr_fdw
(see “Querying Other Tabular Formats with ogr_fdw”).
The reasons for replicating your databases distill down to two: availability and scalability. Availability is assured by providing a redundant server so that, if your main server goes down, you have another that can immediately assume its role. For small databases, you could just make sure you have another physical server ready and restore the database onto it. But for large databases (in the terabytes), the restore itself could take hours, if not days. To avoid downtime, you’ll need to replicate.
The other motivation for replications is scalability. Suppose you set up a database to breed fancy elephant shrews for profit. After a few years of breeding, you now have thousands of elephant shrews. People all over the world come to your site to gawk and purchase. You’re overwhelmed by the traffic, but replication comes to your aid. You arrange a read-only slave server to replicate with your main server. Then you direct the countless gawkers to the slave, and let only serious buyers onto the master server to finalize their purchases.
Before we get too carried away, we should introduce some common lingo in PostgreSQL replication:
The master server is the database server sourcing the data being replicated and where all updates take place. You’re allowed only one master when using the built-in server replication features of PostgreSQL. Plans are in place to support multimaster replication scenarios. Watch for it in future releases. You may also hear the term publisher used to mean the provider of the data. Publisher/subscriber terminology gains more traction in PostgreSQL 10 for built-in logical replication.
A slave server consumes the replicated data and provides a replica of the master. More aesthetically pleasing terms such as subscriber and agent have been bandied about, but slave is still the most apropos. PostgreSQL built-in replication supports only read-only slaves at this time.
WAL is the log that keeps track of all transactions, often referred to as the transaction log in other database products. To stage replication, PostgreSQL simply makes the logs available to the slaves. Once slaves have pulled the logs, they just need to execute the transactions therein.
A transaction on the master will not be considered complete until at
least one synchronous slave listed in
synchronous_standby_names
updates and reports
back. Prior to version 9.6, if any synchronous slave responds, the
transaction is complete. In version 9.6 and higher, the number of
standbys that must respond is configurable using the
synchronous_standby_names
postgresql.conf
configuration variable. Version 10 introduced the keywords FIRST
and ANY that can be added to the synchronous_standby_names
configuration variable that dictates which nodes need to report
back. FIRST is the default behavior if not specified and the
behavior of 9.6.
A transaction on the master will commit even if no slave updates. This is expedient for distant servers where you don’t want transactions to wait because of network latency, but the downside is that your dataset on the slave might lag behind. Should the lag be severe, the slave might need to be reinitialized if the transaction it needs to continue has already been removed from the WAL logs.
To minimize the risk of WALs being removed before all slaves have used them, version 9.4 introduced replication slots. A replication slot is a contract between a slave and its master whereby the master will not wipe out any WAL logs that are still needed by any replication slots. The hazard is that if a slave holding a replication slot fails or loses communication for a long time, the master will keep the WALS indefinitely and run out of disk space and shut down.
The slave does not require direct file access between master and slaves. Instead, it relies on the PostgreSQL connection protocol to transmit the WALs.
Slaves can receive logs from nearby slaves instead of directly from the master. This allows a slave to behave like a master for replication purposes. The slave remains read-only. When a slave acts both as a receiver and a sender, it is called a cascading standby.
This is a new replication option in version 10 that allows the
replication of individual tables instead of requiring the whole
server cluster to be replicated. It relies on a feature called logical decoding, which
extracts changes to a database table from the WAL logs in an
easy-to-understand format without detailed knowledge of the
database’s internal state. Logical decoding has existed since 9.4
and has been used by some extensions for auditing and providing
replication. This new feature comes with the new DDL commands CREATE PUBLICATION
and
CREATE SUBSCRIPTION
for designating what tables to
replicate and what servers and corresponding database to send data
to.
To use this feature, you must set wal_level
to
logical
.
Refer to Logical Replication in PostgreSQL 10 for an example of its use.
Remastering promotes a slave to be the master. Version 9.3 introduced streaming-only remastering, which eliminates the need for remastering to consult a WAL archive; it can be done via streaming, and slaves no longer need to be recloned. As of version 9.4, though, a restart is still required. This may change in future releases.
PostgreSQL binary replication replicates only changes that are
transactional. Because any DDL command is transactional, the creation of tables,
views, and installation of extensions can be replicated as well. But
because unlogged table inserts and updates are not transactional, they
cannot be replicated. When installing extensions, you should make sure
all slaves have the binaries for the extension and version of extension
you are installing; otherwise, replication will fail when the CREATE EXTENSION
command is executed on
the master.
PostgreSQL’s stock replication relies on WAL shipping. Streaming replication slaves should be running the same OS and bitness (32-bit/64-bit) as the master. It is also recommended that all servers be running the same minor version as the master, though running the same patch level (microversion) is not required. Though not recommended, the slave and master can be running a different minor version. In this case, it’s preferable for the slave to be running a newer minor version than the master.
Support for built-in replication improved over the following PostgreSQL releases:
Version 9.4 added replication slots. A replication slot is a contract between a master and a slave that requires the master to hold on to WALs until a slave is done processing them.
Version 9.5 added several functions for monitoring the progress of replication: refer to Replication Progress Tracking in the documentation.
Version 9.6 introduced multiple standby servers in synchronous replication for increased reliability.
Version 10 introduced built-in logical replication, which allows the replication of individual tables. The other benefit of logical replication is that a slave can have databases and tables of its own that are not part of replication and that can be updated on the slave. Version 10 also introduced temporary replication slots, which allow a process to create a replication slot on a one-time basis and have it disappear after the session is over. This is particularly useful for initializing a new copy of the server via pg_basebackup.
Although logical replication is built into PostgreSQL for the first time in version 10, you can use logical replication in PostgreSQL 9.4 and higher versions of PostgreSQL 9 through the open source PostgreSQL extension pglogical. If you need to replicate between version 10 and versions 9.4−9.6, you’ll need to have pglogical installed on both version 10 and the lower-versioned server. For logical replication between version 10 and future versions of PostgreSQL, you can use the built-in logical replication feature.
As alternatives to PostgreSQL’s built-in replication, common third-party options abound. Slony and Bucardo are two popular open source ones. Although PostgreSQL is improving replication with each new release, Slony, Bucardo, and other third-party replication options still offer more flexibility. Slony and Bucardo allow you to replicate individual databases or even tables instead of the entire server. They also don’t require that all masters and slaves run the same PostgreSQL version and OS. Both also support multimaster scenarios. However, both rely on additional triggers and possible addition of columns to tables to initiate the replication and often don’t replicate DDL commands for rare actions such as creating new tables, installing extensions, and so on. Thus, they require more manual intervention, such as the addition of triggers, additional table fields, or views.
We urge you to consult a comparison matrix of popular third-party options before deciding what to use.
Let’s go over the steps to replicate the whole server cluster. We’ll take advantage of streaming replication. Recall that streaming replication only requires connections at the PostgreSQL database level between the master and slaves.
The steps for setting up the master are:
CREATE ROLE pgrepuser REPLICATION LOGIN PASSWORD 'woohoo';
Alter the following configuration settings in postgresql.auto.conf. These can be done
using ALTER SYSTEM set variable=value
followed by
SELECT pg_reload_conf();
without the need to touch the
physical config file:
listen_addresses = * wal_level = hot_standby archive_mode = on max_wal_senders = 5 wal_keep_segments = 10
If you want to use logical replication to do partial replication of only some tables, you’ll need to set wal_level = logical. Logical does more logging than hot_standby so will also work for doing full server replication.
These settings are described in Server Configuration:
Replication. You may want to set wal_keep_segments higher if
your servers are far apart and your production server has a lot of
transactions. If you are running version 9.6 or above, you should
use replica
instead of hot_standby
for the
wal_level. hot_standby
is still
accepted in 9.6 for backward compatibility, but will be read as
replica
.
Add the archive_command
configuration directive to postgresql.auto.conf or use ALTER
SYSTEM
to indicate where the WALs will be saved. With
streaming, you’re free to choose any directory. More details on this
setting can be found in the PostgreSQL PGStandby
documentation.
On Linux/Unix, your archive_command
line should look
something like:
archive_command = 'cp %p ../archive/%f'
You can also use rsync
instead of cp
if you want to store the WALs on a different server:
archive_command = 'rsync -av %p [email protected]:archive/%f'
archive_command = 'copy %p ..\archive\%f'
Add a rule to pg_hba.conf
allowing the slaves to replicate. As an example, the following
rule will allow a PostgreSQL account named pgrepuser
on a server on your private network with an IP address
in the range 192.168.0.1 to 192.168.0.254 to replicate using an md5
password:
host replication pgrepuser 192.168.0.0/24 md5
Restart the PostgreSQL service for the settings to take effect.
Use the pg_basebackup
utility, found in the bin
folder of your PostgreSQL installation, to create a cluster backup.
This will create a copy of the data cluster files in the specified
directory.
When using pg_basebackup, use the
--xlog-method-stream
switch to also copy over the WAL
logs and the -R
switch to automatically create a config
file. The command --xlog-method-stream
will spawn
another database connection for copying the WALs.
In version 10 and above, the pg_xlog directory is pg_wal.
In the following example, we are on the slave server and performing a streaming basebackup from our master server (192.168.0.1):
pg_basebackup -D /target_dir -h 192.168.0.1 --port=5432 --checkpoint=fast --xlog-method=stream -R
If you are using pg_basebackup primarily for backup purposes, you
can use the tarred/compressed form, which will create a tar.gz file in the target_dir folder for each table space.
-X
is shorthand for --xlog-method
. The tarred/compression format
does not support streaming logs, so you have to resort to fetching the
logs with that format:
pg_basebackup -Z9 -D /target_dir/ -h 192.168.0.1 -Ft -Xfetch
For backup, you will want to augment your backup to include transaction log shipping backup using pg_receivexlog for versions prior to 10. For versions 10 and above, pg_receivexlog was renamed to pg_receivewal. This you’ll want to keep running as a cronjob or service to continually make log backups.
This part is not needed for logical replication. To minimize headaches, slaves should have the same configuration as the master, especially if you’ll be using them for failover. They must also have the same set of PostgreSQL extensions installed in binary; otherwise, when CREATE EXTENSION is played back, it will fail and stop restore. In order for the server to be a slave, it must be able to play back the WAL transactions of the master. The steps for creating a slave are as follows:
Create a new instance of PostgreSQL with the same version (preferably even microversions) as your master server. For PostgreSQL, keeping servers identical for microversions is not a requirement, and you’re welcome to experiment and see how far you can deviate.
Shut down PostgreSQL on the new slave.
Overwrite the data folder files with those you generated with pg_basebackup.
Add the following configuration setting to the postgresql.auto.conf file:
hot_standby = on max_connections = 20 #set to higher or equal to master
You don’t need to run the slaves on the same port as the
master, so you can optionally change the port either via postgresql.auto.conf, postgresql.conf, or via some other
OS-specific startup script that sets the PGPORT
environment variable before startup.
Create a new file in the data folder called recovery.conf with the following contents, but substitute the actual hostname, IP address, and port of your master on the second line. This file is automatically created if you used pg_basebackup. You will have to add the trigger_file line though.
The application_name is optional but useful if you want to track the replica in postgresql system views:
standby_mode = 'on' primary_conninfo = 'host=192.168.0.1 port=5432 user=pgrepuser password=woohoo application_name=replica1' trigger_file = 'failover.now'
If you find that the slave can’t play back WALs fast enough, you can specify a location for caching. In that case, add to the recovery.conf file a line such as the following, which varies depending on the OS.
restore_command = 'cp %p ../archive/%f'
restore_command = 'copy %p ..\archive\%f'
In this example, the archive folder is where we’re caching.
After you have made the basebackup with pg_basebackup and put it in place, verify that the settings in the recovery.conf look right. Then start up the slave server.
You should now be able to connect to both servers. Any changes you make on the master, even structural changes such as installing extensions or creating tables, should trickle down to the slave. You should also be able to query the slave.
When and if the time comes to liberate a chosen slave, create a blank file called failover.now in the data folder of the slave. PostgreSQL will then complete playback of the WAL and rename the recovery.conf file to recover.done. At that point, your slave will be unshackled from the master and continue life on its own with all the data from the last WAL. Once the slave has tasted freedom, there’s no going back. In order to make it a slave again, you’ll need to go through the whole process from the beginning.
New in version 10 is the ability to replicate only some of the tables or some of the databases in your master using an approach called logical replication. One big benefit of logical replication is you can use it to replicate between a PostgreSQL 10 database and future versions of PostgreSQL and even replicate when OS platforms or architectures are different. For example, you can use it to replicate between a Linux server and a Windows server.
In logical replication, the server providing the data is called
the publisher
and the server
receiving the data is called the subscriber
. You use CREATE
PUBLICATION
on the publishing server in the database with tables you want to
publish to dictate what tables to replicate and CREATE SUBSCRIPTION
on the subscriber database denoting the server and
publication name it should subscribe to. The main caveat with logical
replication is that DDL is not replicated, so in order to replicate a
table, the table structure must exist on both the publisher database and
the subscriber database.
We have two PostgreSQL 10 servers running on our server. The publisher is on port 5447 and the subscriber is on port 5448. The process is the same if clusters are on separate servers. To replicate:
Make sure the following configuration setting is set on the publisher:
SHOW wal_level
If anything other than logical
, do:
ALTER SYSTEM SET wal_level = logical;
And then restart the postgres service.
This can be set on the subscription server as well, especially if in some cases the subscription server will act as a publisher for some tables or databases.
On the database where you will be replicating data, create the table structures for tables you will be replicating. If you have a lot of tables or want to replicate a whole database, as we will be doing, use pg_dump on the publishing database to create backup structure of tables. For example, for the postgresql_book database, we would dump out the structure:
pg_dump -U postgres -p5447 -Fp --section pre-data --section post-data -f pub_struct.sql postgresql_book
And then use psql on the subscriber server to create our subscription database with structures as follows:
CREATE DATABASE book_sub; connect book_sub; i pub_struct.sql
We then create a publication on the publisher database of items we want to replicate. For this exercise, we’ll replicate all the tables in the database using CREATE PUBLICATION. Note that this command will also replicate future tables created, though we’ve had to create the structure on the subscription databases:
CREATE PUBLICATION full_db_pub FOR ALL TABLES;
In order to use the publication, we need to subscribe to it. We do this by executing this command when connected to the subscriber database book_sub:
connect book_sub; CREATE SUBSCRIPTION book_sub CONNECTION 'host=localhost port=5447 dbname=postgresql_book user=postgres' PUBLICATION full_db_pub;
When you inspect the tables on the book_sub database, you should find that all the tables are full of data collected during the initial synchronization. If you add data to the postgresql_book database, you should see the new records appear on the book_sub database.
If you no longer need a subscription or publication, you can drop them from the publisher with DROP SUBSCRIPTION and DROP PUBLICATION.
FDWs are an extensible, standard-compliant method for your PostgreSQL server to query other data sources, both other PostgreSQL servers and many types of non-PostgreSQL data sources. At the center of the architecture is a foreign table, a table that you can query like other tables in your PostgreSQL database but that resides on another database, perhaps even on another physical server. Once you put in the effort to establish foreign tables, they persist in your database and you’re forever free from having to worry about the intricate protocols of communicating with alien data sources. You can also find the status of popular FDWs and examples of usage at PostgreSQL Wiki FDW. You can find a catalog of some FDWs for PostgreSQL at PGXN FDW and PGXN Foreign Data Wrapper. You’ll find the source code for many of these and for additional ones on GitHub by searching for PostgreSQL Foreign Data Wrappers. If you need to wrap foreign data sources, start by visiting these links to see whether someone has already done the work of creating wrappers. If not, try creating one yourself. If you succeed, be sure to share it with others.
Most PostgreSQL installs provide two FDWs; you can install file_fdw
and postgres_fdw
using the CREATE EXTENSION
command.
Up through PostgreSQL 9.2, you could use FDWs only to read from
foreign sources. Version 9.3 introduced an API feature to update foreign
tables as well. postgres_fdw
supports
updates.
In this section, we’ll demonstrate how to register foreign servers, foreign users, and foreign tables, and finally, how to query foreign tables. Although we use SQL to create and delete objects in our examples, you can perform the exact same commands using pgAdmin.
The file_fdw
wrapper is
packaged as an extension. To install it, use the following
SQL:
CREATE EXTENSION file_fdw;
Although file_fdw
can read only
from file paths accessible by your local server, you still need to
define a server for it for the sake of consistency. Issue the following
command to create a “faux” foreign server in your database:
CREATE SERVER my_server FOREIGN DATA WRAPPER file_fdw;
Next, you must register the tables. You can place foreign tables in any schema you want. We usually create a separate schema to house foreign data. For this example, we’ll use our staging schema, as shown in Example 10-1.
Here are a few initial lines of the pipe-delimited file we are linking to, to show the format of the data we are taking in:
Dev|Company Tom Lane|Crunchy Data Bruce Momjian|EnterpriseDB
CREATE FOREIGN TABLE staging.devs (developer VARCHAR(150), company VARCHAR(150)) SERVER my_server OPTIONS ( format 'csv', header 'true', filename '/postgresql_book/ch10/devs.psv', delimiter '|', null '' );
In our example, even though we’re registering a pipe-delimited
file, we still use the csv
option. A
CSV file, as far as FDW is concerned, represents a file
delimited by any specified character.
When the setup is finished, you can finally query your pipe-delimited file directly:
SELECT * FROM staging.devs WHERE developer LIKE 'T%';
Once you no longer need the foreign table, drop it using:
DROP FOREIGN TABLE staging.devs;
Often, flat files have a different number of columns on each line
and could include multiple header and footer rows. Our favorite FDW for
handling these files is file_textarray_fdw
.
This wrapper can handle any kind of delimited flat file, even if the
number of elements vary from row to row, by treating each row as a text
array (text[]
).
Unfortunately, file_textarray_fdw
is not part of the core
PostgreSQL, so you’ll need to compile it yourself. First, install
PostgreSQL with PostgreSQL development headers. Then download the
file_textarray_fdw
source code from
the
Adunstan GitHub site. There is a different branch for each
version of PostgreSQL, so make sure to pick the right one. Once you’ve
compiled the code, install it as an extension, as you would any other
FDW.
If you are on Linux/Unix, it’s an easy compile if you have
the postgresql-dev
package installed. We did the work of compiling for Windows; you can
download our binaries from one of the following links: one for Windows 32/64 9.4 FDWs, and another
for Windows 32/64 9.5 and 32/64 9.6
FDWs.
The first step to perform after you have installed an FDW is to create an extension in your database:
CREATE EXTENSION file_textarray_fdw;
Then create a foreign server as you would with any FDW:
CREATE SERVER file_taserver FOREIGN DATA WRAPPER file_textarray_fdw;
Next, register the tables. You can place foreign tables in any schema you want. In Example 10-2, we use our staging schema again.
CREATE FOREIGN TABLE staging.factfinder_array (x text[]) SERVER file_taserver OPTIONS ( format 'csv', filename '/postgresql_book/ch10/DEC_10_SF1_QTH1_with_ann.csv', header 'false', delimiter ',', quote '"', encoding 'latin1', null '' );
Our example CSV begins with eight header rows and has more columns
than we care to count. When the setup is finished, you can finally query
our delimited file directly. The following query will give us the names
of the header rows where the first column of the header is
GEO.id
:
SELECT unnest(x) FROM staging.factfinder_array WHERE x[1] = 'GEO.id'
This next query will give us the first two columns of our data:
SELECT x[1] As geo_id, x[2] As tract_id FROM staging.factfinder_array WHERE x[1] ~ '[0-9]+';
The PostgreSQL FDW, postgres_fdw
, is packaged with most distributions of PostgreSQL since PostgreSQL
9.3. This FDW allows you to read as well as push updates to other
PostgreSQL servers, even different versions.
Start by installing the FDW for the PostgreSQL server in a new database:
CREATE EXTENSION postgres_fdw;
Next, create a foreign server:
CREATE SERVER book_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', port '5432', dbname 'postgresql_book');
If you need to change or add connection options to the foreign
server after creation, you can use the ALTER SERVER
command. For example, if you needed to change the server you are
pointing to, you could enter:
ALTER SERVER book_server OPTIONS (SET host 'prod');
Changes to connection settings such as the host, port, and database do not take effect until a new session is created. This is because the connection is opened on first use and is kept open.
Next, create a user, mapping its public role to a role on the foreign server:
CREATE USER MAPPING FOR public SERVER book_server OPTIONS (user 'role_on_foreign
', password 'your_password
');
The role you map to must exist on the foreign server and have login rights. Anyone who can connect to your database will be able to access the foreign server as well.
Now you are ready to create a foreign table. This table can have a
subset of columns of the table it connects to. In Example 10-3, we create a foreign
table that maps to the census.facts
table.
CREATE FOREIGN TABLE ft_facts ( fact_type_id int NOT NULL, tract_id varchar(11), yr int, val numeric(12,3), perc numeric(6,2) ) SERVER book_server OPTIONS (schema_name 'census', table_name 'facts');
This example includes only the most basic options for the foreign
table. By default, all PostgreSQL foreign tables are updatable, unless
the remote account you use doesn’t have update access. The
updatable
setting is a Boolean setting that can be changed at the foreign table or the
foreign server definition. For example, to make your table read-only, execute:
ALTER FOREIGN TABLE ft_facts OPTIONS (ADD updatable 'false');
You can set the table back to updatable
by running:
ALTER FOREIGN TABLE ft_facts OPTIONS (SET updatable 'true');
The updatable
property at the
table level overrides the foreign server setting.
In addition to changing OPTIONS
, you can also add and drop columns
with the ALTER FOREIGN TABLE .. DROP COLUMN
statement.
PostgreSQL 9.5 introduced the IMPORT FOREIGN SCHEMA
command, which saves a great deal of time by automatically creating
the foreign tables for you. Not all FDWs support IMPORT FOREIGN SCHEMA.
Each FDW can also support a custom set of server options when importing.
postgres_fdw
supports the following custom options:
import_collate
This copies the collation settings from the foreign server
for the foreign tables. The default for this setting is
true
.
import_default
This controls whether default values for columns should be
included. The default for the option is false
, so
columns on the local server have no defaults. But default values
are useful during inserts: if you neglect to specify the value of
a column, PostgreSQL automatically inserts the default. Be
careful, though—the behavior of default could be unexpected if
you’re relying on a sequence for auto-numbering. The next assigned
value from the sequence could be different between the foreign
server and the local server.
import_not_null
This controls whether NOT NULL constraints are imported. The
default is true
.
In Example 10-4, we import all tables in our books.public schema.
CREATE SCHEMA remote_census; IMPORT FOREIGN SCHEMA public FROM SERVER book_server INTO remote_census OPTIONS (import_default 'true');
The IMPORT FOREIGN SCHEMA
, as shown in Example 10-4, will create foreign
tables with the same names as those in the foreign schema and create
them in the designated schema
remote_census
.
To bring in only a subset of tables, use LIMIT TO
or EXCEPT
modifiers. For example, to bring in just the facts
and lu_fact_types
tables, we could have
written:
IMPORT FOREIGN SCHEMA census LIMIT TO (facts, lu_fact_types) FROM SERVER book_server INTO remote_census;
If a table specified in the LIMIT TO
does not exist
on the remote server, no error will be thrown. You might want to verify
after the import that you have all the foreign tables you
expected.
A companion clause to LIMIT TO
is the
EXCEPT
clause. Instead of bringing in tables listed, it
brings in tables not listed.
If you take advantage of PostgreSQL extensions, you’ll want to use the performance enhancement foreign server option introduced in version 9.6, called extensions. To utilize it, add the option to an existing postgres_fdw server as we do in the following example:
ALTER SERVER census(OPTION ADD extensions 'btree_gist, pg_trgm');
The extensions option is a comma-separated list of extensions installed on the foreign server. When PostgreSQL runs a query involving any of the types or functions defined in the extension in a WHERE clause, it will try to push the function calls to the remote server for improved performance. If the extensions option is not specified, all extension functions will be run locally, which may require transferring more data.
There are many FDWs for querying other relational databases or flat file formats. Most FDWs target a specific kind of data source. For example, you can find the MongoDB FDW for querying MongoDb data, Hadoop FDW for querying Hadoop datasources, and MySQL FDW for querying MySQL data sources.
There are two FDWs we are aware of that bundle many formats. Multicorn FDW is really an FDW API that allows you to write your own FDW in Python. There are some ready-made drivers available, but the Multicorn FDW currently has no offering on Windows and is often tricky to get working on Linux.
ogr_fdw
is another FDW that
supports many formats, and the one we’ll demonstrate in this section.
ogr_fdw supports many tabular formats, such as spreadsheets, Dbase
files, and CSVs, as well as other relational databases. It is also a
spatial database driver that transforms spatial columns from other
databases like SQL Server or Oracle into the PostGIS PostgreSQL spatial
geometry type.
Several packages that distribute PostGIS also offer the ogr_fdw extension. For instance, the PostGIS Bundle for Windows found on the stack builder includes the ogr_fdw extension, ogr_fdw for CentOS/RHEL is available via yum.postgresql.org, and BigSQL Linux/Mac/Windows PostgreSQL distribution also offers ogr_fdw. If you need or want to compile it yourself, the source for ogr_fdw is on GitHub.
Underneath the hood, ogr_fdw relies on the Geospatial Data Abstraction Library (GDAL) to do the heavy lifting. Therefore, you need to have GDAL compiled and installed before being able to compile or use ogr_fdw. GDAL has undergone quite a few evolutions, and its capabilities vary according to the dependencies it was compiled with. So be warned that your GDAL may not be our GDAL. GDAL is generally installed as part of PostGIS, the spatial extension for PostgreSQL. So to make GDAL use easier, we recommend always installing the latest version of PostGIS.
Many GDAL instances come with support for Excel, LibreOffice Calc, ODBC, and various Spatial web services. You will find support for Microsoft Access on Windows, but rarely on Linux/Mac distributions.
After you have installed the ogr_fdw binaries, to enable the ogr_fdw in a particular database, connect to the database and run:
CREATE EXTENSION ogr_fdw;
Foreign servers take on different meanings depending on the type of data source. For example, a folder of CSV files would be considered a server, with each file being a separate table. A Microsoft Excel or LibreOffice Calc workbook would be considered a server, with each sheet in the workbook being a separate table. An SQLite database would be considered a server and each table a foreign table.
The following example links a LibreOffice workbook as a server and corresponding spreadsheets as foreign tables:
CREATE SERVER ogr_fdw_wb FOREIGN DATA WRAPPER ogr_fdw OPTIONS ( datasource '/fdw_data/Budget2015.ods', format 'ODS' ); CREATE SCHEMA wb_data; IMPORT FOREIGN SCHEMA ogr_all FROM SERVER ogr_fdw_wb INTO wb_data;
The ogr_all
schema is a catch-all that imports all tables in the foreign
server regardless of schema. Some datasources schemas and some don’t. To
accommodate all inputs, ogr_fdw (in place of ogr_all) accepts the
initial characters of a table name as the schema. So, for example, if
you wanted to import just a subset of worksheets where the worksheet
name begins with “Finance,” you would replace ogr_all with
“Finance”:
CREATE SCHEMA wb_data; IMPORT FOREIGN SCHEMA "Finance" FROM SERVER ogr_fdw_wb INTO wb_data;
The schema is case-sensitive, so if the name of a worksheet contains uppercase characters or nonstandard characters, it needs to be quoted.
This next example will create a server pointing to a folder of CSV files. Create a schema ff to house foreign tables for the CSV server. The FDW will then create foreign tables linked to CSV files where the CSV filename begins with Housing in schema ff:
CREATE SERVER ogr_fdw_ff FOREIGN DATA WRAPPER ogr_fdw OPTIONS (datasource '/fdw_data/factfinder', format 'CSV'); CREATE SCHEMA ff; IMPORT FOREIGN SCHEMA "Housing" FROM SERVER ogr_fdw_ff INTO ff;
In the aforementioned example CSV files named Housing_2015.csv and Housing_2016.csv will be linked in as foreign tables in schema ff with names housing_2015 and housing_2016.
ogr_fdw by default launders table names and column names: all uppercase table names and column names are converted to lowercase. If you don’t want this behavior, you can pass in settings in IMPORT FOREIGN SCHEMA to keep table names and column names as they were named in the foreign table. For example:
IMPORT FOREIGN SCHEMA "Housing" FROM SERVER ogr_fdw_ff INTO ff OPTIONS(launder_table_names 'false', launder_column_names 'false');
This creates the tables with names Housing_2015 and Housing_2016, where the column names of the tables would appear in the same case as they are in the header of the files.
The database world does not appear to be getting more homogeneous. Exotic databases are spawned faster than virile elephants. Some are fads and quickly drown in their own hype. Some aspire to dethrone relational databases altogether. Some could hardly be considered databases. The introduction of FDWs is in part a response to the growing diversity. FDW assimilates without compromising the PosgreSQL core.
In this next example, we’ll demonstrate how to use the www_fdw
FDW to
query web services. We borrowed the example from www_fdw Examples.
The www_fdw
FDW is not
generally packaged with PostgreSQL. If you are on Linux/Unix, it’s an
easy compile if you have the postgresql-dev
package installed and can download the latest
source. We did the work of compiling for some Windows platforms;
you can download our binaries from Windows-32 9.1 FDWs and Windows-64 9.3 FDWs.
Now create an extension to hold the FDW:
CREATE EXTENSION www_fdw;
Then create your Google foreign data server:
CREATE SERVER www_fdw_server_google_search FOREIGN DATA WRAPPER www_fdw OPTIONS (uri 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0');
The default format supported by www_fdw
is JSON, so we didn’t need to include
it in the OPTIONS
modifier. The other supported format is
XML. For details on additional parameters that you can set, refer to the
www_fdw
documentation.
Each FDW is different and comes with its own API settings.
Next, establish at least one user for your FDW. All users that
connect to your server should be able to access the Google search
server, so here we create one for the entire public
group:
CREATE USER MAPPING FOR public SERVER www_fdw_server_google_search;
Now create your foreign table, as shown in Example 10-5. Each field in the table corresponds to a GET parameter in the URL that Google creates for a search.
CREATE FOREIGN TABLE www_fdw_google_search ( q text, GsearchResultClass text, unescapedUrl text, url text, visibleUrl text, cacheUrl text, title text, content text ) SERVER www_fdw_server_google_search;
The user mapping doesn’t assign any rights. You still need to grant rights before being able to query the foreign table:
GRANT SELECT ON TABLE www_fdw_google_search TO public;
Now comes the fun part. We search with the term New in PostgreSQL 9.4
and mix in a bit of
regular expression goodness to strip off HTML tags:
SELECT regexp_replace(title,E'(?x)(< [^>]*? >)','','g') As title FROM www_fdw_google_search WHERE q = 'New in PostgreSQL 10' LIMIT 2;
title --------------------- PostgreSQL 10 Roadmap PostgreSQL: Roadmap (2 rows)