Edit: Change control is still key.
Originally posted March 25, 2014 on AIXchange
Recently, a customer was unable to run a DLPAR command against some of the LPARs on their frame. That in itself isn’t unusual. Generally in these situations the network isn’t communicating between the HMC and the LPAR, or perhaps RMC daemons need to be restarted somewhere.
This environment had dual HMCs connected to the managed system. HMC1 controlled some of the LPARs and HMC2 controlled others, but not by design. Although there was no rhyme or reason to it, for simplification let’s say that HMC1 was controlling LPAR1 and LPAR3 and HMC2 was controlling LPAR2 and LPAR4. The correct setup would have been HMC1 and HMC2 controlling LPAR1, LPAR2, LPAR3 and LPAR4. In reality approximately 40 LPARs were on the frame, with each HMC controlling approximately half of the LPARs.
If you were on HMC1, you could DLPAR LPAR1 and LPAR3 with no issues. If you were on HMC2, you could DLPAR LPAR2 and LPAR4 with no issues. The problem was that the only way to know which HMC was controlling which LPAR was to either login to the HMC command line and run lspartition –dlpar, or use the HMC GUI and select HMC Management > View Network Topology. There was no way to know which HMC you needed to login to to manage which LPAR. This headache needed to be resolved.
Initially we did some troubleshooting with IBM Support. That resulted in us running things like:
/usr/sbin/rsct/install/bin/recfgct
/usr/sbin/rsct/bin/rmcctrl –p
We tried getting root access via pedbg. We also tried collecting a snap:
/usr/sbin/rsct/bin/ctsnap
Eventually, once we escalated high enough up the support food chain, someone noticed a very basic HMC setup problem:
The LPAR IBM.MgmtDomainRM default file shows this msg where it’s attempting to create a IBM.MCP entry for hmc1. It fails with Error number 14, duplicate key for localhost.
2610-652 The specified time limit has been exceeded.
Mon Feb 17 13:04:32 CST 2014(439849) ../../../../../src/rsct/rm/MgmtDomainRM/MCP_cfg.c/01438/1.22 2613-034 Error number 14 was returned when attempting to define an IBM.MCP resource.
2610-014 The key token localhost is a duplicate.
During the initial build of the HMCs, they had been given their unique hostname and IP address, but somehow someone made a change that resulted in both hostnames being reset to localhost. Since these HMCs had the same hostname and ran on the same network, only one of the two was capable of managing LPARs at any given time. The other one would always fail.
Needless to say, if you run multiple HMCs, make sure they have unique hostnames. And in any environment, it’s essential to establish good change control so people aren’t making changes to systems without proper approvals and documentation.