Checking replication to ensure availability

One of the core jobs of every administrator is to ensure that replication stays up and running at all times. If replication is down, it is possible that data could be lost if the master crashes. Therefore, keeping an eye on replication is absolutely necessary.

Fortunately, PostgreSQL provides system views, which allow us to take a deep look at what is going on. One of those views is pg_stat_replication:

d pg_stat_replication 
View "pg_catalog.pg_stat_replication"
Column | Type | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
pid | integer | | |
usesysid | oid | | |
usename | name | | |
application_name | text | | |
client_addr | inet | | |
client_hostname | text | | |
client_port | integer | | |
backend_start | timestamp with time zone | | |
backend_xmin | xid | | |
state | text | | |
sent_lsn | pg_lsn | | |
write_lsn | pg_lsn | | |
flush_lsn | pg_lsn | | |
replay_lsn | pg_lsn | | |
write_lag | interval | | |
flush_lag | interval | | |
replay_lag | interval | | |
sync_priority | integer | | |
sync_state | text | | |

The pg_stat_replication view will contain information about the sender. I don't want to use the word master here because slaves can be connected to some other slave. It is possible to build a tree of servers. In the case of a tree of servers, the master will only have information about the slaves it is directly connected to.

The first thing we will see in this view is the process ID of the WAL sender process. It helps us to identify the process in case something goes wrong. This is usually not the case. Then, we will see the username the slave uses to connect to its sending server. The client_* fields will indicate where the slaves are. We will be able to extract network information from those fields. The backend_start field shows when the slaves started to stream from our server.

Then, there is the magical backend_xmin field. Suppose you are running a master/slave setup. It is possible to tell the slave to report its transaction ID to the master. The idea behind this is to delay cleanup on the master so that data is not taken from a transaction running on the slave.

The state field informs us about the state of the server. If our system is fine, the field will contain streaming. Otherwise, closer inspection is needed.

The next four fields are really important. The sent_lsn field, formerly the sent_location field, indicates how much WAL has already reached the other side, which implies that they have been accepted by the WAL receiver. We can use it to figure out how much data has already made it to the slave. Then, there is the write_lsn field, formerly the write_location field. Once WAL has been accepted, it is passed on to the OS. The write_lsn field will tell us that the WAL position has safely made it to the OS already. The flush_lsn field, formerly the flush_location field, will know how much WAL the database has already flushed to disk.

Finally, there is replay_lsn, formerly the replay_location field. The fact that WAL has made it to the disk on the standby does not mean that PostgreSQL has already replayed or been made visible to the end user yet. Suppose that replication is paused. Data will still flow to the standby. However, it will be applied later. The replay_lsn field will tell us how much data is already visible.

In PostgreSQL 10.0, more fields have been added to pg_stat_replication; the *_lag fields indicate the delay of the slave and offer a convenient way to see how far a slave is behind.

The fields are at different intervals so that we can see the time difference directly.

Finally, PostgreSQL tells us whether replication is synchronous or asynchronous.

If we are still on PostgreSQL 9.6, we might find it useful to calculate the difference between the sending and the receiving servers in bytes. The *_lag fields don't do this for 9.6 yet, so having the difference in bytes can be very beneficial. Here's how it works:

SELECT client_addr, pg_current_wal_location() - sent_location AS diff 
FROM pg_stat_replication;

When running this on the master, the pg_current_wal_location() function returns the current transaction log position. PostgreSQL 9.6 has a special datatype for transaction log positions, called pg_lsn. It features a couple of operators, which are used here to subtract the slave's WAL position from the master's WAL position. The view outlined here, therefore, returns the difference between two servers in bytes (replication delay).

Note that this statement only works in PostgreSQL 10. The function used to be called pg_current_xlog_location() in older releases.

While the pg_stat_replication system view contains information on the sending side, the pg_stat_wal_receiver system view will provide us with similar information on the receiving side:

test=# d pg_stat_wal_receiver 
View "pg_catalog.pg_stat_wal_receiver"
Column | Type | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
pid | integer | | |
status | text | | |
receive_start_lsn | pg_lsn | | |
receive_start_tli | integer | | |
received_lsn | pg_lsn | | |
received_tli | integer | | |
last_msg_send_time | timestamp with time zone | | |
last_msg_receipt_time | timestamp with time zone | | |
latest_end_lsn | pg_lsn | | |
latest_end_time | timestamp with time zone | | |
slot_name | text | | |
sender_host | text | | |
sender_port | integer | | |
conninfo | text | | |

After the process ID of the WAL receiver process, PostgreSQL will provide you with the status of the process. Then, the receive_start_lsn field will tell you about the transaction log position the WAL receiver started at, while the receive_start_tli field will inform you about the timeline used when the WAL receiver was started.

The received_lsn field contains information about the WAL position, which was already received and flushed to disk. Then, we've got some information about the time, as well as information about slots and connections.

In general, many people find it easier to read the pg_stat_replication system view than the pg_stat_wal_receiver view, and most tools are built around the pg_stat_replication view.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset