Rob McNelly on ‘Lights-Out Data Center’ Issues, the Latest IBM Announcements and More

Originally published by TechChannel October 19, 2022

Have you heard of “lights-out data centers?” Rob McNelly explains what they are along with their pitfalls, and explores the latest IBM announcements here.

Though I’ve heard about lights-out data centers for years, I truly don’t envision a future where humans will never set foot on the raised floor. We’ll always need hands and eyes in the room to perform tasks on our systems.

Case in point: Recently I serviced a customer that had three of their four fibre network ports inactive on their network switch. For example, ent0 showed that we were disconnected:

                 entstat -d ent0
                  Link Status: Down
                  Media Speed Selected: Autonegotiation
                  Media Speed Running: Unknown

While ent1 was fine. We were connected:

                  entstat -d ent1
                  Link Status: Up
                  Media Speed Selected: Autonegotiation
                  Media Speed Running: 1000 Mbps Full Duplex

The OS was not seeing the expected connection on the network ports. This was verified by the network team, who could also see from the switch side that the ports they expected to have connections were in fact not connected.

During this call, we learned that this was an ongoing issue. The client initially tried replacing small form factor pluggables (SFPs) on the switch. They physically verified that the expected ports from the server were plugged into the correct ports on the switch.

We had the luxury of being able to swap cables, and lo and behold, the problem followed the cables. What was the working port prior to the swap ceased to function, and vice versa. Was it a bad cable? Nope. We tried a different cable and had the same issue.

At that point it was lights on, figuratively, in our heads, because we realized that the TX and RX polarity was reversed on the cable. So we asked the onsite team to correct the cable and plug it back into the switch. As expected, the port fired right up. All three of the non-working ports had this issue, so we did two more reversals of the TX and RX.

Working remotely, we could adjust the switch and logical configurations on the server all we wanted, but it wouldn’t have accomplished anything. To fix this problem, we needed people on site.

On that note, be sure to show your appreciation for the CEs and any data center personnel you work with. If you yourself are a “hands and eyes” person, then I thank you, too. Remember, without the professionals who work directly on these systems and associated equipment, none of us are doing much of anything.

First Impressions of IBM October 11 Announcements 

As IBM Champion Alan Fulton notes, there is indeed much to unpack with IBM’s October 10 announcements, starting with updates to PowerVM, vHMC, PowerVC, AIX and IBM i.

There’s plenty that caught my eye as well:

  • Support for AIX install and boot from iSCSI attached storage. Consult the IBM System Storage Interoperation Center (SSIC) for additional information on supported configurations.
  • Increased NFS file size limit beyond 32 TB. See the AIX 7.3 TL1 Release Notes for the new supported limits
  • AIX tar command support for pax archive format. Previously, AIX tar supported only star archive format. The new pax format archive can be created using the “– format=pax” option in AIX tar command.
  • Improvements in AIX dump performance through hardware-accelerated compression on IBM POWER9 and Power10 systems
  • JFS2 filesystem now allows dynamic switching between inline and outline logging
  • The chpv command now provides an option to force offline a poorly performing PV in a mirrored pair
  • Ability to perform VIOS updates using vHMC

That’s just an abbreviated list. Read the announcement letter for yourself.

Two More Tales From the Field

Another customer had a VIO server that was spitting out unexpected vfchost errors in the error log, so they opened a ticket. IBM Support pointed them to this information

Problem: Qlogic or Cavium IBM fibre adapters in Power Systems register as targets in the SAN fabric instead of initiators.
Symptom: Any one of a number of symptoms might be present, including:

1. NPIV client LPARs do not discover devices during scans by system firmware (SMS or ioinfo)
2. AIX hosts can fail to boot
3. The AIX error log might be filled with extraneous errors when SAN monitoring software runs, or even when cfgmgr runs, as the adapter attempts to log in to itself. The errors decode as name server query failures. Detailed SENSE DATA indicates the failure was against the physical adapter’s own N_Port ID.

The doc also notes that, on a VIO server, these steps must be performed from the oem_setup_env prompt. Then reboot the VIO server. In our case, we followed the directions and the errors went away. So keep this in mind should you run across something similar.

And one final story: Yet another customer that uses SSH to access a server wanted to determine why the sessions would end when left open for some time. There are a few ways to deal with this problem, but start here, and scroll down for this response

ssh -o TCPKeepAlive=yes -o ServerAliveCountMax=20 -o ServerAliveInterval=15 my-user-name@my-server-domain-name-here

My customer tried that solution from their command line and it worked, and eventually they made the change to their /etc/ssh/sshd_config file so they no longer needed to enter those options on the command line.

IBM Support Forums Have Moved 

As of October 11, the IBM Support forums are now part of the IBM Community website:

“To improve your support experience and provide you with the best possible access to people who know and understand your products, the Support Forums join the IBM Community on October 11, 2022.

Simply visit the IBM Community website to search for and continue discussing your products there. The IBM Support site will provide a link to the IBM Community for some time after the move, but we recommend all users update bookmarks pointing to the Support site’s Forums as soon as possible. To make this transition as easy as possible, the Forums will remain on the Support site until November 11, but you will only be able to read questions and responses there, not post new ones.”

While I’m on the topic of IBM Support, be sure to check out the Complete Guide To Must Gather LPM Data Collection on PowerVC, VIO, AIX, Linux and IBM i.

The Latest From Nigel and Chris 

If you know AIX, you know Nigel Griffiths. And if you know Nigel, you know nmon is his baby. He recently sent out this information.

If you have thousands of nmon files, you can drown in the high volumes of data. You need to extract the key facts to allow planning your server consolidation, migrating to newer servers or Power Live Partition Mobility. These nsum shell scripts allow does the hard work to build a CSV file to import into a spreadsheet for further work.

Also via Twitter, Chris Gibson points to this document on migrating workloads to Power9 and Power10 systems. And on his personal blog, he explains how to find the hardware uptime for Power Systems frames.

“We needed to find the hardware uptime for a particular POWER9 frame to determine how close we were to hitting this known POWER9 firmware bug.

We found we could calculate this by looking at the “Progress Indicator History” view in ASMI and looking at the date associated with RUNTIME/STANDBY and working out how many days had passed since the frame was powered up.”

In that post he links to a C program that calculates the uptime, so check it out.