PostgreSQL Replication

There are plenty of tutorials available online, which give step-by-step instructions to manage the replication of PostgreSQL clusters with Repmgr. Once the setup is complete, what is important to look for? How does your client handle failover? How to deal with a failed or unreachable standby node? How to reintegrate a failed or unreachable standby node? How do you monitor your cluster?

Client Failover

When your primary node goes down and a standby is promoted. How do you let your clients handle this change?

Official driver

A simple solution is to use official driver capabilities. As mentioned in the documentation, you can list all nodes in your cluster and the connection will be established only on the primary node.

jdbc:postgresql://primary,standby1,standby2/database?targetServerType=master

In this situation, the primary node will handle all the read and write queries. However, it’s possible to create a second connection dedicated to read queries on standby nodes.

jdbc:postgresql://primary,standby1,standby2/database?targetServerType=preferSlave

Handmade solution

It’s possible to create your own solution to enable connection failover. For example in this article, a shell script is used to check PostgreSQL status and HAProxy is used to perform failover.

Repmgrd

Repmgrd is the Repmgr daemon. It monitors the PostgreSQL cluster and performs necessary actions based on the state of the cluster. It performs automatic failover in the case that the primary node goes down by promoting the most eligible standby as the new primary.

Repmgrd is a critical process that should be running at all times. The failover mechanism would not be able to kick in if it were to stop working. Therefore, it is highly recommended to enable “auto restart” by overriding the provided daemon’s configuration file with systemd.

.include /lib/systemd/system/repmgr.service
[Service]
Restart=always
RestartSec=10
StartLimitInterval=60
StartLimitBurst=3

With this configuration, the daemon will be restarted in case of failure after a timeout of 10 seconds. It will try 3 times during an interval of 60 seconds. In any case, if Repmgr daemon is going down, it means there is a problem with the service or on the server and simply restarting the service may not fix the issue.

Promotion

Promoting a new primary is one of the most important actions during a failover situation. Repmgr knows when by doing reconnect attempts. Depending on the quality of your network, you may need to avoid promotion caused by network latency by modifying reconnect_attempts and reconnect_interval

# Number of attempts which will be made to reconnect to an unreachable primary (or other upstream node)
reconnect_attempts=6                                                          

# Interval between attempts to reconnect to an unreachable primary (or other upstream node)
reconnect_interval=10

With this configuration, a total of 6 attempts will be made with 10 seconds between attempts before promoting a new primary using promote_command.

As you can see from the documentation, promote_command in Repmgr configuration is used in a failover situation to promote a new primary. Example:

promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'

In fact, in the case of a failover situation, something wrong is happening so it may be a good action to perform a backup in addition to promotion. For example, this command will perform a backup using pgbackrest.

promote_command='/usr/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file && sleep 120 && pgbackrest --stanza=my_stanza --type=full backup'

Monitoring

PostgreSQL Replication statistics are available on current primary. An overall vision can be seen on primary node by executing:

select * from pg_stat_replication;

pg_stat_replication contains statistics about each WAL sender process connected to a standby process.

-[ RECORD 1 ]----+----------------------------------------------
pid              | 11881
usesysid         | 16388
usename          | repmgr
application_name | <application_name>
client_addr      | <ip>
client_hostname  | 
client_port      | 58212
backend_start    | 2019-11-12 16:52:04.51763+01
backend_xmin     | 
state            | streaming
sent_lsn         | 138/6C000000
write_lsn        | 138/6C000000
flush_lsn        | 138/6C000000
replay_lsn       | 138/6C000000
write_lag        | 
flush_lag        | 00:00:00.29105
replay_lag       | 
sync_priority    | 0
sync_state       | async
-[ RECORD 2 ]----+----------------------------------------------
pid              | 11879
usesysid         | 16388
usename          | repmgr
application_name | <application_name>
client_addr      | <ip>
client_hostname  | 
client_port      | 35170
backend_start    | 2019-11-12 16:52:03.053909+01
backend_xmin     | 
state            | streaming
sent_lsn         | 138/6C000000
write_lsn        | 138/6C000000
flush_lsn        | 138/6C000000
replay_lsn       | 138/6C000000
write_lag        | 
flush_lag        | 00:00:00.268068
replay_lag       | 
sync_priority    | 0
sync_state       | async
-[ RECORD 3 ]----+----------------------------------------------
pid              | 5201
usesysid         | 16388
usename          | repmgr
application_name | <application_name>
client_addr      | <ip>
client_hostname  | 
client_port      | 52130
backend_start    | 2019-11-08 16:22:01.688119+01
backend_xmin     | 
state            | streaming
sent_lsn         | 138/6C000000
write_lsn        | 138/6C000000
flush_lsn        | 138/6C000000
replay_lsn       | 138/6C000000
write_lag        | 
flush_lag        | 00:00:00.23605
replay_lag       | 
sync_priority    | 0
sync_state       | async

First of all, sync_state column indicates what type of replication used. In this example, the value async only shows that is an asynchronous replication and it’s definitely not a problem as I have seen explained in some articles. When you setup a replication cluster, you had to choose between asynchronous and synchronous replication.

In postgresql.conf file, it’s possible to specify a list of standby servers to support synchronous replication with synchronous_standby_names. It means that transactions will commit only after standby synchronous servers confirm receipt of the data.

What is important to look is state column. Value streaming indicates the WAL sender is streaming changes to its connected standby server. Then, an important health indicator is the amount of WAL records generated the primary node, but not yet applied by standby server: streaming lag.

sent_lsn shows the last WAL sent on WAL sender connection. LSN stands for Log Sequence Number and it’s a position in the Write-Ahead Log stream. Thus, an important gap between current WAL and sent_lsn may indicate that primary server is under heavy load. We can get current WAL by using pg_current_wal_lsn function and pg_wal_lsn_diff to compute the difference.

select (pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn)) as primary_streaming_lag FROM pg_stat_replication;

The result is in bytes.

The view pg_stat_replication provides also the last WAL position written, flushed and replayed on standby database. This way, we can have the standby total streaming lag by executing this query for each standby node :

select (pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn)) as standby_streaming_lag FROM pg_stat_replication;

select application_name,
pg_current_wal_lsn() as current_WAL_lsn,
sent_lsn as last_sent_WAL,
replay_lsn last_replay_WAL,
(pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn))::int / 1024 as primary_streaming_lag,
(pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::int / 1024 as standby_streaming_lag
FROM pg_stat_replication;

Let’s check this query by adding some network latency

tc qdisc add dev eth0 root handle 1: prio priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tc qdisc add dev eth0 parent 1:2 handle 20: netem delay 1000ms
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dst <ip mask> flowid 1:2

This command adds a delay of 1 second for nodes that match a specific IP range.

select application_name,
pg_current_wal_lsn() as current_WAL_lsn,
sent_lsn as last_sent_WAL,
replay_lsn last_replay_WAL,
(pg_wal_lsn_diff(pg_current_wal_lsn(),sent_lsn))::int / 1024 as primary_streaming_lag,
(pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))::int / 1024 as standby_streaming_lag
FROM pg_stat_replication;

-[ RECORD 1 ]---------+----------------------------------------------
application_name      | <application_name>
current_wal_lsn       | 138/7D054AF8
last_sent_wal         | 138/7D054AF8
last_replay_wal       | 138/7D054AF8
primary_streaming_lag | 0
standby_streaming_lag | 0
-[ RECORD 2 ]---------+----------------------------------------------
application_name      | <application_name>
current_wal_lsn       | 138/7D054AF8
last_sent_wal         | 138/7D054AF8
last_replay_wal       | 138/7D000000
primary_streaming_lag | 0
standby_streaming_lag | 338
-[ RECORD 3 ]---------+----------------------------------------------
application_name      | <application_name>
current_wal_lsn       | 138/7D054AF8
last_sent_wal         | 138/7D054AF8
last_replay_wal       | 138/7D000000
primary_streaming_lag | 0
standby_streaming_lag | 338

primary_streaming_lag and standby_streaming_lag values are in KB as it’s divided by 1024. This is nice but it’s only available on primary node. If we are experiencing lag on a standby node, it can mean that this node is under heavy load or network latency. It’s important to monitor replication directly on standby nodes. And so, Repmgr has its own table to monitor replication.

select * from repmgr.replication_status;

-[ RECORD 1 ]-------------+----------------------------------------------
primary_node_id           | 1
standby_node_id           | 3
standby_name              | <application_name>
node_type                 | standby
active                    | t
last_monitor_time         | 2019-11-22 11:07:51.673005+01
last_wal_primary_location | 143/4C009390
last_wal_standby_location | 143/4C009390
replication_lag           | 0 bytes
replication_time_lag      | 00:00:00
apply_lag                 | 0 bytes
communication_time_lag    | 00:00:00.960348
-[ RECORD 2 ]-------------+----------------------------------------------
primary_node_id           | 1
standby_node_id           | 4
standby_name              | <application_name>
node_type                 | standby
active                    | t
last_monitor_time         | 2019-11-22 11:07:51.100838+01
last_wal_primary_location | 143/4C0091D0
last_wal_standby_location | 143/4C0091D0
replication_lag           | 0 bytes
replication_time_lag      | 00:00:00
apply_lag                 | 0 bytes
communication_time_lag    | 00:00:01.532515
-[ RECORD 3 ]-------------+----------------------------------------------
primary_node_id           | 1
standby_node_id           | 2
standby_name              | <application_name>
node_type                 | standby
active                    | t
last_monitor_time         | 2019-11-22 11:07:51.310996+01
last_wal_primary_location | 143/4C0092B0
last_wal_standby_location | 143/4C0092B0
replication_lag           | 0 bytes
replication_time_lag      | 00:00:00
apply_lag                 | 0 bytes
communication_time_lag    | 00:00:01.322357

This view is created and available on each node, primary and standby. You can check how this view is built on Repmgr Github, depends your Repmgr version.

Furthermore, Repmgr offers a check service by node.

/usr/bin/repmgr -f /etc/repmgr.conf node check

Node "<application_name>":
	Server role: OK (node is standby)
	Replication lag: OK (0 seconds)
	WAL archiving: OK (10 pending archive ready files)
	Downstream servers: OK (this node has no downstream nodes)
	Replication slots: OK (node has no physical replication slots)
	Missing physical replication slots: OK (node has no missing physical replication slots)
	Configured data directory: OK (configured "data_directory" is "<data_directory>")

Replication lag checks if the node is lagging by more than replication_lag_warning and replication_lag_critical parameters. By default, it’s 300 and 600 ms.

A big thanks to Karen and Marc Barret for their time and proofreading.

Photo by Jan Kolar / VUI Designer on Unsplash