Edit: Some links no longer work.
Originally posted August 2, 2011 on AIXchange
A customer recently called because they couldn’t login to their machine. A new server was being built, and someone had rebooted the virtual machine. Once the system came back up, no one could ssh or telnet to it, though they were able to ping it across the network.
I was in a location that allowed me to set up webex. This way, we could both see what was going on instead of me simply hearing about it over the phone.
We started by running putty and making an ssh connection to the HMC. From there, we ran vtmenu, chose a frame and selected the LPAR on that frame. We were able to open the console window, and we had a login prompt. However, we couldn’t login as root. We tried a few different combinations of user IDs and passwords, but no luck. The machine appeared responsive, though. Had someone changed the passwords?
The decision was made to reboot the machine and login in maintenance mode. This way we could change the root password and get logged in to verify the network communications.
Because this environment wasn’t virtualized, it wasn’t as easy as simply booting from a virtual optical disk. We also discovered that the NIM server lived on this non booting LPAR, so booting from NIM to get into maintenance mode wasn’t going to work.
Luckily the disk controller that the CD was attached to was available, so we made the controller and the CD available to this LPAR and had someone load the physical AIX DVD into the drive. We booted the LPAR into SMS mode and then selected the correct CD device to boot the machine. Instead of choosing to install AIX, we started maintenance mode for system recovery. Then we chose to access a root volume group and start a shell.
Now we were logged in as root, and we were able to poke around. The filesystems looked OK after running a df, but when we tried to run the passwd command, we got an error. Everything pointed to a corrupt /etc/passwd file, but when we attempted to look at that file, we found that it didn’t exist. Someone had accidentally wiped it out. However, because /etc/security/passwd still existed their passwords were still there, and we just needed to get a copy of /etc/passwd back into the system. Once we did so and rebooted the machine, it came right up and we could login.
We did see a few rm –rf commands in .sh_history, but we didn’t find the actual smoking gun to prove that the file was deleted. We did learn though that someone was copying /etc/passwd files around the environment, so it was certainly possible that this person erred when manipulating the files.
So how is your environment set up? Are you taking mksysbs? Are you backing up individual files so that you can recover them if needed? Do you have a NIM server available to boot and restore from? Do you have install media handy that you can boot from? Install media was the key in this case. Although my customer’s problem was fairly trivial and relatively easy to fix, having the install media on hand allowed us to resolve the issue quickly.