High Availability Cluster Testing
In the earlier post, we have learned how to configure High Availability in HANA.
Now we need to test whether our configuration is working as expected.
Auto-failover – in this method you need to deploy additional host to the current HANA database and configure it to work in standby mode. In case the active node failures, the standby host can automatically switch operations to the secondary node. This solution requires a shared storage.
System replication – in this solution you need to install separate HANA system and configure replication for data changes. By default, the system replication doesn’t support High Availability as HANA database doesn’t support automatic failover.
But you can use the features of SUSE Linux to enhance the base solution.
Testing of Cluster
In the following test descriptions, when the parameters are PREFER_SITE_TAKEOVER=”true” and AUTOMATED_REGISTER=”false”.
TEST 1: STOP PRIMARY DATABASE ON NODE 1
The primary HANA database is stopped during normal cluster operation.
- TEST PROCEDURE
Stop the primary HANA database gracefully as sidadm.
HAPRD# HDB stop
- RECOVERY PROCEDURE
Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
HAPRD# hdbnsutil -sr_register –remoteHost=HANAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
- Restart the HANA database (now secondary) on node 1 as root.
HAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD
Expected:
- The cluster detects the stopped primary HANA database (on node 1) and marks the resource failed.
- The cluster promotes the secondary HANA database (on node 2) to take over as primary.
- The cluster migrates the IP address to the new primary (on node 2).
- After some time, the cluster shows the sync_state of the stopped primary (on node 1) as SFAIL.
- Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
- After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
- The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 2: STOP PRIMARY DATABASE ON NODE 2
The primary HANA database is stopped during normal cluster operation.
- TEST PROCEDURE
Stop the database gracefully as sidadm.
HAPRDSHD# HDB stop
- RECOVERY PROCEDURE
Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
mode02# hdbnsutil -sr_register –remoteHost=HANAPRD –remoteInstance=02 -replicationMode=sync — name=NODE
- Restart the HANA database (now secondary) on node 2 as root.
HANAPRDSHD # crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRDSHD
Expected:
- The cluster detects the stopped primary HANA database (on node 2) and marks the resource failed.
- The cluster promotes the secondary HANA database (on node 1) to take over as primary.
- The cluster migrates the IP address to the new primary (on node 1).
- After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
- Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
- After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
- The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 3: CRASH PRIMARY DATABASE ON NODE 1
- TEST PROCEDURE
Kill the primary database system using signals as hspadm.
HANAPRD# HDB kill-9
- RECOVERY PROCEDURE
Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
HAPRD# hdbnsutil -sr_register –remoteHost=HAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
- Restart the HANA database (now secondary) on node 1 as root.
HAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD
Expected:
- The cluster detects the stopped primary HANA database (on node 1) and marks the resource failed.
- The cluster promotes the secondary HANA database (on node 2) to take over as primary.
- The cluster migrates the IP address to the new primary (on node 2).
- After some time, the cluster shows the sync_state of the stopped primary (on node 1) as SFAIL.
- Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
- After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
- The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 4: CRASH PRIMARY DATABASE ON NODE 2
Simulate a complete break-down of the primary database system.
- TEST PROCEDURE
Kill the primary database system using signals as hspadm.
HAPRDSHD# HDB kill-9
- RECOVERY PROCEDURE
Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
HAPRDSHD# hdbnsutil -sr_register –remoteHost=HANAPRD –remoteInstance=02 – replicationMode=sync –name=NODEB
- Restart the HANA database (now secondary) on node 1 as root.
HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD
Expected:
- The cluster detects the stopped primary HANA database (on node 2) and marks the resource failed.
- The cluster promotes the secondary HANA database (on node 1) to take over as primary.
- The cluster migrates the IP address to the new primary (on node 1).
- After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
- Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
- After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
- The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 5: CRASH PRIMARY SITE NODE (NODE 1)
Simulate a crash of the primary site node running the primary HANA database.
TEST PROCEDURE
Crash the primary node by sending a ‘fast-reboot’ system request.
RECOVERY PROCEDURE
- If SBD fencing is used then pacemaker will not automatically restart after being fenced, in this case clear the fencing on all SBD devices and subsequently start pacemaker.
HANAPRD# systemctl start pacemaker
- Manually register the old primary (on node 1) with the new primary after takeover (on node 2) as hspadm.
HAPRD# hdbnsutil -sr_register –remoteHost=HANAPRDSHD –remoteInstance=02 – replicationMode=sync –name=NODEA
Restart the HANA database (now secondary) on node 1 as root.
HANAPRD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HANAPRD
Expected:
The cluster detects the failed node (node 1) and declares it UNCLEAN and sets the secondary node (node 2) to status “partition WITHOUT quorum”.
The cluster fences the failed node (node 1).
The cluster declares the failed node (node 1) OFFLINE.
The cluster promotes the secondary HANA database (on node 2) to take over as primary.
The cluster migrates the IP address to the new primary (on node 2).
After some time, the cluster shows the sync_state of the stopped primary (on node 2) as SFAIL.
If SBD fencing is used, then the manual recovery procedure will be used to clear the fencing and restart pacemaker on the node.
Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 6 CRASH SECONDARY SITE NODE (NODE 2)
Simulate a crash of the secondary site node running the primary HANA database.
TEST PROCEDURE
Crash the secondary node by sending a ‘fast-reboot’ system request.
RECOVERY PROCEDURE
- If SBD fencing is used then pacemaker will not automatically restart after being fenced, in this case clear the fencing on all SBD devices and subsequently start pacemaker.
HAPRDSHD# systemctl start pacemaker
- Manually register the old primary (on node 2) with the new primary after takeover (on node 1) as hspadm.
HAPRDSHD# hdbnsutil -sr_register –remoteHost=HAPRD –remoteInstance=02 – replicationMode=sync –name=NODEB
Restart the HANA database (now secondary) on node 2 as root.
HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD
Expected:
The cluster detects the failed secondary node (node 2) and declares it UNCLEAN and sets the primary node (node 1) to status “partition WITHOUT quorum”.
The cluster fences the failed secondary node (node 2).
The cluster declares the failed secondary node (node 2) OFFLINE.
The cluster promotes the secondary HANA database (on node 1) to take over as primary.
The cluster migrates the IP address to the new primary (on node 1).
After some time, the cluster shows the sync state of the stopped secondary (on node 2) as SFAIL.
If SBD fencing is used, then the manual recovery procedure will be used to clear the fencing and restart pacemaker on the node.
Because AUTOMATED_REGISTER=”false” the cluster does not restart the failed HANA database or register it against the new primary.
After the manual register and resource cleanup the system replication pair is marked as in sync (SOK).
The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 7 STOP THE SECONDARY DATABASE ON NODE 2
The secondary HANA database is stopped during normal cluster operation.
TEST PROCEDURE
Stop the secondary HANA database gracefully as sidadm.
HAPRDSHD# HDB stop
RECOVERY PROCEDURE
Cleanup the failed resource status of the secondary HANA database (on node 2) as root.
HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD
Expected:
The cluster detects the stopped secondary database (on node 2) and marks the resource failed.
The cluster detects the broken system replication and marks it as failed (SFAIL).
The cluster restarts the secondary HANA database on the same node (node 2). 4. The cluster detects that the system replication is in sync again and marks it as ok (SOK).
The cluster “failed actions” are cleaned up after following the recovery procedure.
TEST 8 CRASH THE SECONDARY DATABASE ON NODE 2
TEST PROCEDURE
Kill the secodary database system using signals as sidadm.
node2# HDB kill-9
RECOVERY PROCEDURE
Cleanup the failed resource status of the secondary HANA database (on node 2) as root.
HAPRDSHD# crm resource cleanup rsc_SAPHana_HSP_HDB02 HAPRDSHD
Expected:
The cluster detects the stopped secondary database (on node 2) and marks the resource failed.
The cluster detects the broken system replication and marks it as failed (SFAIL).
The cluster restarts the secondary HANA database on the same node (node 2).
The cluster detects that the system replication is in sync again and marks it as ok (SOK).
The cluster “failed actions” are cleaned up after following the recovery procedure.
Test 9 TEST FAILURE OF REPLICATION LAN
Loss of replication LAN connectivity between the primary and secondary node.
TEST PROCEDURE
Break the connection between the cluster nodes on the replication LAN.
RECOVERY PROCEDURE:
Re-establish the connection between the cluster nodes on the replication LAN.
Expected:
After some time, the cluster shows the sync_state of the secondary (on node 2) as SFAIL.
The primary HANA database (node 1) “HDBSettings.sh systemReplicationStatus.py” shows
“CONNECTION TIMEOUT” and the secondary HANA database (node 2) is not able to reach the primary database (node 1).
The primary HANA database continues to operate as “normal”, but no system replication takes place and is therefore no longer a valid take over destination.
Once the LAN connection is re-established, HDB automatically detects connectivity between the HANA databases and restarts the system replication process
The cluster detects that the system replication is in sync again and marks it as ok (SOK).
Using Maintenance Mode
Every now and then, you need to perform upgrade or maintenance tasks on individual cluster components or the whole cluster—be it changing the cluster configuration, updating software packages for individual nodes, or upgrading the cluster to a higher product version.
Using Maintenance Mode via HAWK
Applying Maintenance Mode to Nodes
Sometimes it is necessary to put single nodes into maintenance mode.
Start a Web browser and log in to the cluster as described in Starting Hawk and Logging In.
In the HOME page, select Nodes Tab.
In one of the individual nodes’ views, click on options next to the node and switch to Maintenance.
This will add the following instance attribute to the node: maintenance=”on”. The resources previously running on the maintenance-mode node will become unmanaged. No new resources will be allocated to the node until it leaves the maintenance mode.
After you have finished, remove the maintenance mode to start normal cluster operation. Set the node to out-maintenance. (ready).
Start a Web browser and log in to the cluster as described in Starting Hawk and Logging In.
In the HOME page, select Nodes Tab.
In one of the individual nodes’ views, click on options next to the node and switch to Maintenance off.
Tag:Cluster, HA, High Availability, pacemaker, Replication, SAP, SAP HANA, System Recovery