Given the goals set for this chapter, the last topic to be done is the customer's complaint about replication, which tends to break from time to time. In PostgreSQL, replication never breaks for technical reasons; only misconfiguration can lead to unexpected behavior. Here are the most common issues:
Let's attack one problem at a time.
One of the most common problems is that at some point, replication just stops. The slave will start to fall behind, and the result coming back from a slave will be old and outdated. There are a couple of reasons for this kind of behavior:
In many cases, bandwidth is a serious issue. If two servers are connected using a 100 MB interconnect, it can easily happen that the slave falls behind and eventually dies if the write load is constantly higher than 10 MB per second. Keep in mind that the master only keeps a couple of xlog
files around, which are needed to recover in case of a crash of the master. In the default configuration, it does not queue transaction logs forever just to handle problems on the slaves. If the load is too high on a systematic basis, this would not help anyway.
In PostgreSQL, the slave requests the transaction log it needs from the master. If the master does not have the desired xlog
file any more, the slave cannot continue. To many users, this looks like a replication failure, which it really is not. It is clearly desired behavior. To reduce the problem straight away, two options are available:
wal_keep_segments
xlog
)In general, it is always recommended to use wal_keep_segments
, unless replication slots are used. Otherwise, it is too risky and a slave is likely to fall much behind its master. When configuring wal_keep_segments
, be gracious and don't use settings that are too low. A nice xlog
retention pool can really pay off.
Replication slots are a nice alternative to wal_keep_segments
. In the case of a normal setup using wal_keep_segments
, a master will keep as much xlog
as configured by the administrator. But how can the administrator know this? Usually, there is no way to know.
A replication slot can help because it keeps xlog
around as long as it is needed. Here is how it works:
postgres=# SELECT * FROM pg_create_physical_replication_slot('rep_slot'), slot_name | xlog_position -------------+--------------- rep_slot |
It is easy to check for existing replication slots:
postgres=# SELECT slot_name, slot_type FROM pg_replication_slots; slot_name | slot_type -------------+----------- rep_slot | physical (1 row)
In the replica, the name of the replication slot has to be added to recovery.conf
:
primary_slot_name = 'rep_slot'
Another important issue is related to failed queries. It can happen that queries are cancelled on the slave due to replication conflicts. If you are working on a system that constantly kicks queries, it can be very helpful to check out hot_standby_feedback
.
The idea is to let the slave report its oldest transaction ID from time to time. The master can then react and ensure that conflicts are prevented by delaying VACUUM
on itself. In reality, setting hot_standby_feedback
on the slave leaves us with the same behavior as if the query had actually run on the master (not performance-wise, of course).
If the people working on those slave systems can be trusted, it usually makes sense to turn hot_standby_feedback
on. If users on the slave are not considered to be competent enough, it is safer to stick to the default value—off.