A Hitch with SEA Failover Testing

Edit: Test test test.

Originally posted October 17, 2017 on AIXchange

A few months back, I ran into an issue during shared Ethernet adapter (SEA) failover testing. After upgrading to VIO server 2.2.5.10, we would fail VIOS1 and verify our disks and networks were functioning as expected on the VIO clients. Then we’d bring VIOS1 back online and fail VIOS2. The network would hang on the VIO clients.

When we checked the status of our SEAs on VIOS1, they would show up as “unhealthy.” The only way we could resolve this was to reboot the VIO server. This was unexpected behavior and not the way failover used to work.

Eventually we found that we could change the settings on the health_time_req attribute so that it would timeout sooner:

Health Time (health_time_req)
Sets the time that is required to elapse before a system is considered “healthy” after a system failover. After a Shared Ethernet Adapter moves to an “unhealthy” state, the Health Time attribute specifies an integer that indicates the number of seconds for which the system must maintain a “healthy” state before it is allowed to return into the Shared Ethernet Adapter protocol. The default value is 600 seconds.

It appears IBM is aware of this issue and working on a fix. Chris Gibson recently relayed this information:

APAR status
Closed as program error.

Problem summary
Given a pair of VIOS LPARs (2.2.5.x and up) with matching SEAs in HA mode (ha_mode set to auto or sharing) with one node in UNHEALTHY state, if the healthy node is rebooted or loses link, the UNHEALTHY node will not assume the PRIMARY state. In the field, a customer reboots the primary LPAR and waits until it is back up. Then the customer reboots the backup LPAR. Unbeknownst to the customer, the primary LPAR has gone into the UNHEALTHY state because the link came up slightly delayed.

When the backup LPAR is shutdown, the primary LPAR does not take over and become PRIMARY as it did before the upgrade.

Problem conclusion
Code changed to disable link check as part of health check and also reduce the default value of health_check attribute to 60 secs and minimum value to 1s.

This is another reason to do plenty of testing after updates. In our case we just went from 2.2.4.22 to 2.2.5.10, yet we were bit by this issue. For anyone doing VIO maintenance, it’s certainly something to be aware of.

Have you seen this type of behavior?