Article Misses the Point on VIOS Use

Edit: Hopefully you are running dual VIOS

Originally posted February 7, 2017 on AIXchange

This was posted on Jan. 17, but it’s worth revisiting. I thought the article was a little over the top, starting with the headline:

“Power Systems running IBM’s VIOS virtualisation need a patch and reboot
Unless you’re willing to tolerate the chance of data corruption”

Here’s what follows:

“IBM on Saturday slipped out news of a nasty bug in its VIOS, its Virtual I/O Server that offers virtualisation services on Power Systems under AIX.

Issue IV91339 strikes when moving virtual machines and means “there is a very small timing window where the VIOS may report to the client LPAR that some I/Os have completed before they actually do.”

IBM advises that “This could cause applications running on the client [logical partition] LPAR to read the wrong data from the virtual device. It’s also possible that data written by the client LPAR to the virtual device may be written incorrectly.

Hence the issue’s title: “possible data corruption after LPM failure.”

Of course data corruption is precisely what Power Systems and AIX are supposed not to do. The platforms are promoted as exceptionally stable and resilient, just the ticket for mission critical applications that can’t afford many maintenance windows, never mind unplanned ones.

So IBM’s guidance that “Installation of the ifix requires a reboot” will not go down well with users.” 

After the article went live, it was updated:

UPDATE: IBM’s now released a fix and updated its advice on this issue.

Big Blue now also says “The risk of hitting this exposure outside of the IBM test lab has had extensive evaluation and is considered extremely small. The controlled test environment where this problem was observed makes use of a high-precision test injection tool that was able to inject a specific error within a tiny window.”

“The chances of hitting this window outside of the IBM test lab are highly unlikely and there is no known occurrence of this issue outside of the IBM test lab.”

The Reg is nonetheless aware that IBM has recommended users implement the patch.”

As I said, I thought this was over the top, and judging by these comments, I wasn’t the only one:

Uh… why not?

patch and boot the secondary, patch and boot the primary. Extra points if you are nice enough to disable the vscsi and vfcs of the corresponding vios first (rmdev -pl $adaptername). Ethernet fails over automatically, though you could add extra grace there as well.

Hardly a big deal. And in order to run into iv91339s bug, you´d have to have a failing lpm in first place.

                    ******************************

If this goes back as far as 2.2.3.X – then, clearly – it is not happening often – and management might decide that the higher risk to business is updating and rebooting a dual VIOS configuration.

As far as change records go: whether they are a major pain or a minor pain or no pain – experience has taught many that no records – ultimately is a ‘killing pain’. This again, is a process that can ensure that the business can manage their risk – as they view it. System administration is not the business – even that “we” have the best of intents “they” must okay the process. That is how business is done.

The argument that should be made is that the systems were engineered for concurrent maintenance. Not doing the maintenance now may lead to a disruptive ‘moment’. The business does not need to know the technical details – it needs to know the relative risk and impact on business. The design – aka best practice – of using dual VIOS is that the impact should be zero – even with a reboot!

                    ******************************

Although there are reasons to go with a single VIOS and with more recent features that provide a cluster-like availability on other servers my preference within my organization is to deploy Dual VIOS. It’s a nominal expense to deploy while having the ability to tell the business the platform will continue to service the dozens of VM’s on each box while we do concurrent maintenance for each VIOS.

We are not shy to our stakeholders either on how we’ve built our Power environment (starting with P4 and now mostly P8) so they have confidence in the platform and our ability to keep it all running virtually non-stop. 

Really, the article’s whole premise is faulty. I can’t recall the last time I saw an environment with VIOS that wasn’t using dual VIO servers. Patching one VIOS, rebooting and then patching the other VIOS is business as usual. Updating VIOS with the client LPARs running is common practice, and isn’t much of a risk in my opinion. During your next patch cycle, add the fix as you always would. This platform is exceptionally stable and resilient, and this article and the comments actually illustrate that point.