State Snapshot Transfer (SST) is one of the two ways used by Galera to perform initial syncing when a node is joining a cluster, until the node is declared as synced and part of the “primary component”. Depending on the dataset size and workload, SST could be lightning fast, or an expensive operation which will bring your database service down on its knees.
SST can be performed using 3 different methods:
- mysqldump
- rsync (or rsync_wan)
- xtrabackup (or xtrabackup-v2, mariabackup)
Most of the time, xtrabackup-v2 and mariabackup are the preferred options. We rarely see people running on rsync or mysqldump in production clusters.
The Problem
When SST is initiated, there are several processes triggered on the joiner node, which are executed by the "mysql" user:
$ ps -fu mysql
UID PID PPID C STIME TTY TIME CMD
mysql 117814 129515 0 13:06 ? 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role donor --address 192.168.55.173:4444/xtrabackup_sst//1 --socket /var/lib/mysql/mysql.sock --datadir
mysql 120036 117814 15 13:06 ? 00:00:06 innobackupex --no-version-check --tmpdir=/tmp/tmp.pMmzIlZJwa --user=backupuser --password=x xxxxxxxxxxxxxx --socket=/var/lib/mysql/mysql.sock --galera-inf
mysql 120037 117814 19 13:06 ? 00:00:07 socat -u stdio TCP:192.168.55.173:4444
mysql 129515 1 1 Oct27 ? 01:11:46 /usr/sbin/mysqld --wsrep_start_position=7ce0e31f-aa46-11e7-abda-56d6a5318485:4949331
While on the donor node:
mysql 43733 1 14 Oct16 ? 03:28:47 /usr/sbin/mysqld --wsrep-new-cluster --wsrep_start_position=7ce0e31f-aa46-11e7-abda-56d6a5318485:272891
mysql 87092 43733 0 14:53 ? 00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup-v2 --role donor --address 192.168.55.172:4444/xtrabackup_sst//1 --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/ --gtid 7ce0e31f-aa46-11e7-abda-56d6a5318485:2883115 --gtid-domain-id 0
mysql 88826 87092 30 14:53 ? 00:00:05 innobackupex --no-version-check --tmpdir=/tmp/tmp.LDdWzbHkkW --user=backupuser --password=x xxxxxxxxxxxxxx --socket=/var/lib/mysql/mysql.sock --galera-info --stream=xbstream /tmp/tmp.oXDumYf392
mysql 88827 87092 30 14:53 ? 00:00:05 socat -u stdio TCP:192.168.55.172:4444
SST against a large dataset (hundreds of GBytes) is no fun. Depending on the hardware, network and workload, it may take hours to complete. Server resources may be saturated during the operation. Despite throttling is supported in SST (only for xtrabackup and mariabackup) using --rlimit and --use-memory options, we are still exposed to a degraded cluster when you are running out of majority active nodes. For example, if you are unlucky enough to find yourself with only one out of three nodes running. Therefore, you are advised to perform SST during quiet hours. You can, however, avoid SST by taking some manual steps, as described in this blog post.
Stopping an SST
Stopping an SST needs to be done on both the donor and the joiner nodes. The joiner triggers SST after determining how big the gap is when comparing the local Galera seqno with cluster's seqno. It executes the wsrep_sst_{wsrep_sst_method} command. This will be picked by the chosen donor, which will start streaming out data to the joiner. A donor node has no capabilities of refusing to serve snapshot transfer, once selected by Galera group communication, or by the value defined in wsrep_sst_donor variable. Once the syncing has started and you want to revert the decision, there is no single command to stop the operation.
The basic principle when stopping an SST is to:
- Make the joiner look dead from a Galera group communication point-of-view (shutdown, fence, block, reset, unplug cable, blacklist, etc)
- Kill the SST processes on the donor
One would think that killing the innobackupex process (kill -9 {innobackupex PID}) on the donor would be enough, but that is not the case. If you kill the SST processes on donor (or joiner) without fencing off the joiner, Galera still can see the joiner as active and will mark the SST process as incomplete, thus respawning a new set of processes to continue or start over again. You will be back to square one. This is the expected behaviour of /usr/bin/wsrep_sst_{method} script to safeguard SST operation which is vulnerable to timeouts (e.g., if it is long-running and resource intensive).
Let's look at an example. We have a crashed joiner node that we would like to rejoin the cluster. We would start by running the following command on the joiner:
$ systemctl start mysql # or service mysql start
A minute later, we found out that the operation is too heavy at that particular moment, and decided to postpone it later during low traffic hours. The most straightforward way to stop an xtrabackup-based SST method is by simply shutting down the joiner node, and kill the SST-related processes on the donor node. Alternatively, you can also block the incoming ports on the joiner by running the following iptables command on the joiner:
$ iptables -A INPUT -p tcp --dport 4444 -j DROP
$ iptables -A INPUT -p tcp --dport 4567:4568 -j DROP
Then on the donor, retrieve the PID of SST processes (list out the processes owned by "mysql" user):
$ ps -u mysql
PID TTY TIME CMD
117814 ? 00:00:00 wsrep_sst_xtrab
120036 ? 00:00:06 innobackupex
120037 ? 00:00:07 socat
129515 ? 01:11:47 mysqld
Finally, kill them all except the mysqld process (you must be extremely careful to NOT kill the mysqld process on the donor!):
$ kill -9 117814 120036 120037
Then, on the donor MySQL error log, you should notice the following line appearing after ~100 seconds:
2017-10-30 13:24:08 139722424837888 [Warning] WSREP: Could not find peer: 42b85e82-bd32-11e7-87ae-eff2b8dd2ea0
2017-10-30 13:24:08 139722424837888 [Warning] WSREP: 1.0 (192.168.55.172): State transfer to -1.-1 (left the group) failed: -32 (Broken pipe)
At this point, the donor should return to the "synced" state as reported by wsrep_local_state_comment and the SST process is completely stopped. The donor is back to its operational state and is able to serve clients in full capacity.
For the cleanup process on the joiner, you can simply flush the iptables chain:
$ iptables -F
Or simply remove the rules with -D flag:
$ iptables -D INPUT -p tcp --dport 4444 -j DROP
$ iptables -D INPUT -p tcp --dport 4567:4568 -j DROP
The similar approach can be used with other SST methods like rsync, mariabackup and mysqldump.
Throttling an SST (xtrabackup method only)
Depending on how busy the donor is, it's a good approach to throttle the SST process so it won't impact the donor significantly. We've seen a number of cases where, during catastrophic failures, users were desperate to bring back a failed cluster as a single bootstrapped node, and let the rest of the members catch up later. This attempt reduces the downtime from the application side, however, it creates additional burden on this “one-node cluster”, while the remaining members are still down or recovering.
Xtrabackup can be throttled with --throttle=<rate of IO/sec> to simply limit the number of IO operation if you are afraid that it will saturate your disks, but this option is only applicable when running xtrabackup as a backup process, not as an SST operator. Similar options are available with rlimit (rate limit) and can be combined with --use-memory to limit the RAM usage. By setting up values under [sst] directive inside the MySQL configuration file, we can ensure that the SST operation won't put too much load on the donor, even though it can take longer to complete. On the donor node, set the following:
[sst]
rlimit=128k
inno-apply-opts="--use-memory=200M"
More details on the Percona Xtrabackup SST documentation page.
However, there is a catch. The process could be so slow that it will never catch up with the transaction logs that InnoDB is writing, so SST might never complete. Generally, this situation is very uncommon, unless if you really have a very write-intensive workload or you allocate very limited resources to SST.
Conclusions
SST is critical but heavy, and could potentially be a long-running operation depending on the dataset size and network throughput between the nodes. Regardless of the consequences, there are still possibilities to stop the operation so we can have a better recovery plan at a better time.