Network Troubleshooting

Edit: It has been a while since I needed to mess with SSA disks.

Originally posted September 2005 by IBM Systems Magazine

Recently, a user opened a problem ticket reporting that copying files back and forth from a server we support was taking an unusually long time. The files weren’t all that large, but the throughput was just terrible. After poking around a bit, we found that the Ethernet card wasn’t set to the correct speed. When we ran lsattr -El ent0, we found the media_speed set to Auto_Negotiation. I knew what the problem was immediately.

We’ve seen the Auto_Negotiation setting on Ethernet adapters to be problematic on AIX. Our fast Ethernet port on the switch was always set to be 100/Full. With Auto_Negotiation on, sometimes the card would correctly set itself to 100/Full, but at other times it would go to 100/Half. This causes the slowdown on the network because you now have collisions on the network, which you can see with netstat -v.

Packets with Transmit collisions:

 1 collisions: 204076      6 collisions: 37         11 collisions: 1

 2 collisions: 65375       7 collisions: 6          12 collisions: 0

 3 collisions: 16894       8 collisions: 2          13 collisions: 0

 4 collisions: 2404        9 collisions: 0          14 collisions: 0

 5 collisions: 255        10 collisions: 2          15 collisions: 0

You can also determine if you’re having Receive Errors and see what speed your adapter is running at by using netstat -v.  You’ll see something similar to the following:

RJ45 Port Link Status : up

Media Speed Selected: Auto negotiation

Media Speed Running: 100 Mbps Full Duplex

Transmit Statistics:                      Receive Statistics:

——————–                          ——————-

Packets: 33608151                      Packets: 82280769

Bytes: 3364953629                      Bytes: 89992126877

Interrupts: 15105                          Interrupts: 79762362

Transmit Errors: 0                         Receive Errors: 14000

Packets Dropped: 1                      Packets Dropped: 14

                                                     Bad Packets: 0

How did we fix the duplex issue? We detached the interface and ran a chdev to make it 100/Full: chdev  -l ‘ent0′ -a media_speed=’100_Full_Duplex’. Once we made this change, there were no more collisions and the user was a happy camper.

Verifying Failed SSA Disks

Another issue that seems to crop up is when SSA disks die. How do you know which physical disk in your drawer needs to be replaced? In some instances, when the disk dies, you’re no longer able to go into Diag / Task Selection / SSA Service Aids / Link Verification to select your disk and identify it because it’s no longer responding. 

In this situation, you can use link verification to identify the SSA disks on either side of the failed disk. You can then look for the disk that’s between the two blinking disks, and you know which disk is bad. Another way to verify that you’ve selected the correct disk to replace is to run lsattr -El pdiskX, where “X” is replaced with your failing pdisk number. This provides the serial number that you can match with the serial number printed on the disk. (Note: The serial number may not be an exact match, but you can match fields 5-12 in the output (omit the trailing 00D) with the printed serial number on the disk.) Here’s the highlighted output:

lsattr -El pdisk45

adapter_a       ssa3             Adapter connection                                   False

adapter_b       none             Adapter connection                                   False

connwhere_shad  006094FE94A100D  SSA Connection Location                              False

enclosure       00000004AC14CB52 Identifier of enclosure containing the Physical Disk False

location                         Location Label                                       True

primary_adapter adapter_a        Primary adapter                                      True

size_in_mb      36400            Size in Megabytes

Another way to find your disk based on its location codes is by using lsdev -C | grep pdiskX. After replacing it, you can simply run rmdev -dl pdiskX, swap it with your replacement disk and run cfgmgr.

If your SSA disk was part of a raid array, hopefully at this point your hot spare took over, and you can just make your replacement disk the new hot spare disk. To make your disk a hot spare, use diag / task selection / ssa service aids / smit — ssa raid arrays / change show use of an ssa physical disk, and change your newly replaced disk from a system disk to a hot spare disk. To verify all is well, I like to go into smitty / devices / SSA RAID Arrays / List Status of Hot Spare Protection for an SSA RAID Array. It should report that the raid array is protected and the status is good. Keep in mind that only the latest SSA adapter (4-P) will allow list status of hot spare protection to work; older cards such as the 4-N don’t have this feature.