Monitor-ing the Situation

Edit: I did get that USB monitor. And I added a few more to my desktop for good measure.

Originally posted June 21, 2016 on AIXchange

Over the years I’ve discovered that you can never have too many monitors connected to your system. I’m reminded of this whenever I go on the road with a laptop and single screen.

One of these days — even though it will mean adding still more weight to my bag — I’ll break down and get a USB monitor for my laptop:

“If you want the screen space of a traditional monitor mated with the kind of portability you can slip into your laptop’s carrying case, there’s a whole sub-class of monitors designed just for you. These products exist in a sort of limbo between full-size monitors and tablet screens in terms of screen size, resolution, and contrast.”

I find a minimum of two monitors helps me multitask. I can be using one screen that’s logged into a system, while my other screen can be reserved for documentation, or for reading one thing while working on another.

I consider 3-5 monitors a pretty good sweet spot, though someday — someday — I hope to procure a wall of monitors like these.

There are other multi-monitor advocates out here. This article notes that there are productivity benefits to dual monitor usage. This PCWorld piece gets into some of the other benefits.

“Having multiple monitors (and I’m talking three, four, five, or even six) is just…awesome, and something you totally need in your life.

Right now, my main PC has a triple-monitor setup: my main 27-inch central monitor and my two 24-inch side monitors. I use my extra monitors for a number of things, such as comparing spreadsheets side-by-side, writing articles while also doing research, keeping tabs on my social media feeds…

A vertically-oriented monitor can save you a lot of scrolling trouble in long documents. If you’re a gamer, well, I don’t need to sell you on three-plus monitors can be for games that support multi-monitor setups. You just need to plan ahead. Here’s our full guide on setting up multiple multiple monitors—and all the factors you’ll need to take into account before you do so.”

Although that article focuses on using a graphics card with all of your monitors connected to the same system, you can also control multiple systems and monitors with software like Synergy:

“Synergy combines your desktop devices together in to one cohesive experience. It’s software for sharing your mouse and keyboard between multiple computers on your desk. It works on Windows, Mac OS X and Linux.”

What is your ideal setup? Are you OK with just one monitor and lots of windows, or do you prefer lots of windows across lots of monitors?

The AIX Expansion Pack

Edit: How often do you use these packages? Some links no longer work.

Originally posted June 14, 2016 on AIXchange

Are you familiar with the AIX Expansion Pack?

“The AIX Expansion Pack is a collection of extra software that extends the base operating system capabilities. The AIX Web Download Pack is a collection of additional applications and tools available for download from the Web. All complement the AIX operating system with the benefit of additional packaged software at no additional charge.”

By selecting the download link from the right side of that page and signing in with your IBM ID, you’ll find a list of different packages available for download, including openssh, openssl, perl, samba, rsyslog and lsof. (Note: These may not be the most current versions of software, so you could run into code issues. Perzl.org may have more up-to-date versions of the software you’re looking for.)

If you’re wondering whether IBM supports specific programs from the IBM Expansion Pack, this pretty handy table can help you determine whether you can open a PMR. Some entries are marked with PMR support, some have critical support only as part of particular products, while others are unsupported.

If you use open-source software in your AIX environment and you’d like IBM to continue to host and maintain offerings like the AIX Expansion Pack, it wouldn’t hurt to let them know about it. The sentiment expressed in this old post still applies:

“I would also recommend telling your local IBM representative that you think this needs to be fixed. Customer pressure is a good incentive for IBM to get organized, sort this out and eventually works.”

Don’t Forget About Server Consolidation

Edit: I want my enterprise class server.

Originally posted June 7, 2016 on AIXchange

You likely know that we can run multiple operating systems on Power servers. With powerful POWER8 servers, we can consolidate workloads such as AIX, IBM i, and Linux and run them simultaneously on the same server — assuming it’s not one of the newer L or LC Linux-only models.

But about those L and LC boxes: They’ve come up a lot in my recent conversations with customers. While IBM is quick to remind customers that it’s still heavily invested in AIX and IBM i and they’re not going away, they’re also up front with their message about Linux and POWER8 servers: It’s a powerful combination.

When customers are interested in going head to head with x86 servers and competing on cost, the Linux-only L and LC models running PowerKVM virtualization make for an easy case. You’ll get better performance at a lower price. In addition, IBM has also made it convenient for new Power customers to run PowerKVM, in that you don’t need an HMC to manage your systems. Obviously an enterprise that doesn’t use the HMC may not want to invest the time to learn about HMCs and VIO servers.

It’s great to see the interest in these offerings. However, I often end up reminding my customers that an existing IBM solution, the PowerVM hypervisor, might actually be a better option for running their Linux workloads.

Linux workloads can run on smaller scale-out servers, but they can also run on larger systems. This is where PowerVM fits in. It handles Linux workloads, even if you’re not running AIX or IBM i on your frame. 

PowerVM is a mature virtualization offering that’s been running mission-critical workloads for years. Think about it: When is the last time you have had an issue with PowerVM? In addition, when compared with PowerKVM, PowerVM has a better guaranteed quality of service and lower virtualization overhead (because the hypervisor is in the firmware rather than running QEMU). With the ability to have multiple VIO servers, you have higher availability for your systems, and the capability to perform maintenance on those redundant VIO servers. Because you have a smaller attack surface with a firmware-based hypervisor, there’s also better VM segregation and better security. In addition, PowerVM allows you to have shared processor pools to reduce licensing costs and guarantee a certain amount of resources to a group of workloads.

PowerVM offers other advantages. You can choose to set up your LPARs with shared dedicated processors. When defining LPARs, you can guarantee a minimum entitlement for your LPAR and you can hard-cap your virtual machines. Assuming you’re running on higher-end hardware, you’ll be able to use capacity on demand and dynamically change more of the settings on your LPARs compared to what you can do with PowerKVM.

As I said, it’s great that IBM has an option in PowerKVM that competes with x86 systems on cost and performance. But here’s the thing many customers forget: Replacing 20 x86 machines with 20 Power L or LC models isn’t the only option. You may find it more beneficial to consolidate those 20 x86 servers into a small number of beefier Power servers running PowerVM. Your data center cabling, power and cooling requirements will all go down, while your average server utilization will go up.

Sure, you could alternatively replace those 20 x86 machines with a smaller number of Linux-only machines. In doing so, you’ll get better performance per core with Power. But with larger enterprise servers, you can have a far greater number of cores and much more available memory to work with when compared to any of the scale-out models.

Even as IBM continues to update and advance its Linux story, there’s still much to be said for consolidating workloads through PowerVM. These servers remain well worth considering.

Upgrading SDDPCM Drivers

Edit: I still love getting scripts from readers

Originally posted May 31, 2016 on AIXchange

In January I posted some scripts I’d received from Simon Taylor. He’s since provided me with more:

“Hi Rob,
Annual upgrades are happening again. We have the common problems with getting downtime, etc., and I wasn’t over keen on the published methods of upgrading sddpcm device drivers. Fortunately, I came across a post by Josh-Daniel S. Davis on replacing the pre-deinstallation script (which fails if there are any active disks) with one that just exits 0.

Here’s how it works:

I’ve added a post installation (-z) script for nimadm alt_disk_migration. The alt_disk migration takes place in a chroot environment and I expected that there would be no real access to disk device drivers from within the chroot. This seems to be true and my environment migrated successfully from AIX 6 and devices.sddpcm.61.rte 2.6.0.3 to AIX 7 and devices.ssdpcm.71.rte 2.6.7.0

I bundled the sddpcm and devices.fcp.disk.ibm.mpio into the post installation script using uuencode because by the time the post installation script runs, the migration lpp_source had been unmounted. (There’s an install_all_updates script built into the migration that tries to upgrade all software in the lpp_source not already updated by the main upgrade logic. The install_all_updates fails on sddpcm).

The script includes a bit of logic (lslpp -Lqc “devices.sddpcm*”) to find the current version and decides whether or not to upgrade. If an upgrade is necessary, the deinstallation script is found using “ls /usr/lpp/devices.sddpcm*/deinstl/*pre_d” and replaced with the exit 0 script. This has helped us towards our goal of one-click upgrades.”

Simon’s .tar file 

includes this information:

    The mk_alt_post_script tars up the contents of the tar subdirectory
    and uuencodes them into a script called post_alt_mig_script which is
    called by the nimadm command. The attached tar file contains:

    ./alt_disk/
    ./alt_disk/tar/                                         # add the lpps here and run inutoc
    ./alt_disk/tar/readme
    ./alt_disk/tar/upgrade_sddpcm.ksh        # removes old sddpcm and installs new
    ./alt_disk/mk_alt_post_script               # builds post_alt_mig_script
    ./alt_disk/readme

What do you think? Is this something you’d find useful in your environments? If you have similar scripts or ideas that can be shared, please contact me.

Finding the Motivation to Change

Edit: I am still more active than I once was, and I have kept the weight off.

Originally posted May 24, 2016 on AIXchange

This blog typically covers AIX and other technical topics. However, every now and again I write about something else that interests me. This week’s topic, honestly, is sensitive.

You’re overweight. Or, if you’re not, you likely know someone who is. The Centers for Disease Control and Prevention estimates that at least one-third of Americans are obese:

“Obesity is common, serious and costly.

More than one-third (34.9% or 78.6 million) of U.S. adults are obese. Obesity-related conditions include heart disease, stroke, type 2 diabetes and certain types of cancer, some of the leading causes of preventable death.

The estimated annual medical cost of obesity in the U.S. was $147 billion in 2008 U.S. dollars; the medical costs for people who are obese were $1,429 higher than those of normal weight.”

This isn’t necessarily a comment on the IT industry, but obviously our work makes it convenient to fall into a sedentary lifestyle:

“How many of us IT professionals are putting on a few pounds? We do generally have relatively sedentary lifestyles. We drive to our jobs, and sit in front of a computer all day. And if we’re not doing that, we’re sitting in a meeting. Then we go home and play video games and/or watch TV and movies. We eat more fast food than fruits and vegetables. Over time, this lifestyle takes its toll.

Starting healthy new habits like eating better and exercising more can be tough. It can be harder still to maintain these habits. I would argue that some in the IT industry — myself included — should think about getting the habit in the first place.”

When I wrote that back in 2009, I was talking to you — and, as noted, literally describing myself. I was eating junk and putting on pounds. But more recently, things have changed dramatically for me.

I’ll be honest: the logical arguments I made back then did nothing to alter my own behaviors. What happened was my sons got involved in Boy Scouts. I wanted to support them. To become an adult leader, you’re required to get a physical and fill out some paperwork. Basically, you need to demonstrate that you’re fit enough to participate in the week-long summer camps and backpacking trips with troops. One of the BSA forms mentions BMI limits:

“Excessive body weight increases risk for numerous health problems. To ensure the best experience, Scouts and Scouters should be of proportional height and weight. One such measure is the Body Mass Index (BMI), which can be calculated using a tool from the Centers for Disease Control here: http://www.cdc.gov/nccdphp/dnpa/bmi/ . Calculators for both adults and youth are available. It is recommended that youth fall within the fifth and 85th percentiles. Those in the 85th to 95th percentiles are at risk and should work to achieve a higher level of fitness.”

My doctor took this information seriously, and told me that he wouldn’t sign my paperwork until my BMI was where it needed to be. That was my wake-up call. I finally took my weight seriously. I finally stopped stuffing my face.

You’ve heard it all before, diet and exercise. That’s all it is. As mathematically inclined people, we should be able to understand that to lose weight we need to eat less than we burn. Skip the french fries and the hamburger buns and the soda. Mix in a salad. More protein, fewer carbs. Watch your portion sizes.

I’ve been going to a gym. I tried that previously, but I’d either lose motivation or get bored, mostly because I had no idea what I was doing. This time, I hired a trainer and attended classes. For me, it’s well worth the cost. Having someone to hold me accountable and vary my activities definitely helps.

Interesting thing: As much as working in technology can lead you to unhealthy lifestyles, there’s now a lot of cool tech stuff that can help you lose weight. There are apps that allow you to scan bar codes on food packages so you can more easily track your caloric intake. I have a scale that automatically connects to the web each time I get on it. It graphs my weight and tracks my BMI measurements. I have heart rate monitors that show how much effort I put into my exercise. I have fitness trackers that count the steps I walk. Obviously you don’t need the gadgets, but as a techie I enjoy them.

I’m much more active now: running, biking, swimming, hiking. I lived near mountains for awhile. Now I climb them. One of this summer’s scouting activities is a trek to the bottom of the Grand Canyon. For the past three years I’ve participated in an event that purports to be the country’s largest Boy Scout triathlon. The first year I tried it, I was so out of shape I didn’t finish. The second year, I did finish, and last year I lowered my time by 12 minutes compared to the year before. Next time out, I expect to reduce my time again, hopefully by a similar margin.

The point is, since December 2012, I’ve lost more than 60 pounds. I’m still a work in progress, but I believe I’m on the right path.

I know it’s unlikely that my story will cause anyone to change, because I understand that I’m not telling you anything you don’t already know. Most of us engage in unhealthy behaviors. We smoke, we drink, we eat too much, we don’t exercise enough. We know about the health risks but for whatever reason, we don’t make meaningful changes. I personally know how it feels to lose weight, and then put it back on. And I know how easy it is to ignore what you see in the mirror.

But now, I also know how it feels to climb mountains without getting winded. I know how it feels to have my heart rate quickly return to normal after vigorous exercise. I know how it feels to go on lengthy hikes carrying a backpack that weighs more than the pounds I’ve lost. I know what it’s like to have to buy new clothes because nothing in the closet fits anymore. And I find all these things so personally gratifying. That’s why I’m sharing this with you.

If nothing else, if you see me at conferences eating junk, you can remind me of this piece. You can help hold me accountable. Or just maybe, someone will read these words and decide to actually make a change. If even one of you does, I’ll consider my efforts worthwhile.

Finding Lifecycle and Other Product Info

Edit: These charts project far into the future.

Originally posted May 17, 2016 on AIXchange

When is my version of AIX or PowerHA going out of support?

These types of questions come up all the time. The good news is there are multiple ways to find quick answers to them.

IBM has a support lifecycle webpage that provides this type of information about these and other products, like VIO server and PowerKVM. You can also learn product IDs, availability dates and all the different versions of various solutions.

“The IBM Software Support Lifecycle policy specifies the length of time support will be available for IBM software from when the product is available for purchase to the time the product is no longer supported. IBM software customers can use this site to track how long their version and release of a particular IBM software product will be supported. Using the information on this site, customers will be able to effectively plan their software investment, without any gaps in support.

Find detailed information about the available IBM Software Support Lifecycle Policies to help you realize the full value of your IBM software products.

Use the search form, or browse by software family or product name, to find the software lifecycle details you need. To stay up to date, subscribe to the lifecycle news feed, or download lifecycle data in XML format to import into your spreadsheet program or custom data processing application.”

Another option is to visit Fix Central and request to view fix packs for a particular AIX version. For example, if you browse to this page and scroll to the bottom, you’ll see a graphic showing the lifecycle for AIX version 7.2, accompanied by some useful verbiage discussing support plans. Additional graphics are available for other AIX versions to help you visualize where you are and when fixes were released.

If you have trouble displaying the graph, you can get there quickly via FLRT lite. Select one of the AIX versions and scroll to the bottom of the new page.

Finally, there’s this AIX support lifecycle chart.

Enhanced Support Options

Edit: Still the only way to go. Many of these links no longer work.

Originally posted May 10, 2016 on AIXchange

If you have IBM maintenance and support contracts on your IBM hardware and software, it’s a straight-forward arrangement. When something breaks, you can open a PMR and get help.

But did you know that different levels of IBM support are available? Two options you might not know of are Enhanced Software Support and Custom Technical Support.

These options are considered upgrades from “standard” IBM support and might be worth looking into for your environment. I have customers that use these services and believe they receive substantial benefits for the extra cost. This stems from IBM being able to provide customized, proactive support as they get to know their unique environments. I’ve seen IBM meet with the customers’ IT staffs via conference calls and online meetings. IBM Support will prepare reports to use in reviewing open and closed PMRs and highlight available fixes that are applicable to their environments.

This datasheet has detailed information:

“But many others prefer to rely on outside services to supplement their in-house staff with the technical expertise they need — while still retaining full control and ownership of their IT infrastructures. And that’s where IBM Software Support Services — custom technical support comes in.

As a CTS client, you are assigned a technical solutions manager who can:

  • Act as an extension of your staff with the added advantage of IBM support
    • Facilitate appropriate service for you and update your priority support team of your needs.
    • Offer custom problem-prevention assistance to help you make more effective maintenance decisions
    • Use IBM proprietary state-of-the-art analysis tools that can anticipate problems and work with you to help prevent them
    • Provide helpful information on new products, practices and technologies as appropriate.”

One of IBM’s analytical tools is called ProWeb. I recommend you watch this introductory video to learn about it.There is also a technical support appliance that’s designed to help you:

  • Streamline IT inventory management by intelligently discovering inventory and support-coverage information for IBM and non-IBM equipment.
  • Improve technical support management with analytics-based reports and collaborative services.
  • Mitigate costly IT outages via operating system and firmware recommendations for selected platforms.

Go here for details.

Did you already know about these IBM offerings?

What’s in Your Bag?

Edit: This was terrifying. Glad I avoided a watch list.

Originally posted May 3, 2016 on AIXchange

If you travel for your job as I do, you probably lug lots of gear. Chargers, cords and adapters are just some of the necessities that keep your gadgets in working order while you’re on the go.

If you spend any time on the raised floor, hopefully you have a cord that you can plug into a PDU, like these. From that cord I plug in a portable power strip, like these. This allows me to plug in all the gear I need during long stints in the computer room. I find power strips also come in handy when I’m sitting in airports and outlets are at a premium. You can be instantly popular by allowing others to plug into yours during a layover. That said, if you don’t carry a power strip, keeping your battery-powered items charged will usually suffice.

Speaking of batteries, I always bring extras for my laptop to keep it powered up for long flights or any extended time away from outlets. I also carry extra batteries for my noise-canceling headphones (which are great on planes or raised floors) and extra external battery packs for charging my cellphone.

All of this is a prelude to a story about the importance of keeping batteries separate from the rest of your gear.

I have Tom Binh’s snake charmer, and am quite happy with it. Typically I’d just cram my cables and batteries and everything else into it and not give it a second thought. Then late last year I’m at the airport, waiting to head home, and I detect the smell of burning wires or plastic. I wrote it off to the holiday lights and decorations that were plugged in all over. Or maybe it was dust on a bulb or something. Once I got on the plane, the smell disappeared, so I didn’t give it a thought — that is, until I pulled my laptop out of my bag. The same burning smell returned. It was coming from my gear.

It was coming from the snake charmer. I had three external batteries inside of it that I use to recharge my cell phone. Somehow one of the prongs from a power adapter had jammed into the USB port on the battery pack:

“Battery pack manufacturers incorporate safety devices into the pack designs to protect the battery from out of tolerance operating conditions and where possible from abuse. While they try to make the battery foolproof, it has often been explained how difficult this is because fools can be so ingenious.

Subjecting a battery to abuse or conditions for which it was never designed can result in uncontrolled and dangerous failure of the battery. This may include explosion, fire and the emission of toxic fumes.”

Here’s more to keep in mind about carrying extra batteries:

“Any kinds of conductive material being bridged with the external terminals of a battery will result in short circuit. Based on the battery system, a short circuit may have serious consequences, e.g. rising electrolyte temperature or building up internal gas pressure. If the internal gas pressure value exceeds the limitation of cell cap endurance, the electrolyte will leak, which will damage battery greatly. If safe vent fails to respond, even explosion will occur. Therefore don’t short circuit.”

So here I am with a battery that has a melting USB slot with a piece of metal jammed and fused into it. It’s hot, it stinks of burning plastic and metal, and it’s on a plane. Is this contraption going to explode or catch fire? And what will become of me? Will the plane be diverted? Will I be kicked off the flight?

These thoughts raced through my mind. But then, fortunately for me, I had a MacGyver moment. I realized I could unscrew the back piece of the charger, which exposed two wires and a small circuit board connected to the battery itself. I just detached the wires from the battery, it immediately stopped smelling, and the battery started to cool down. Crisis averted.

The amazing part was no one said a thing. A guy seated nearby was watching me, and two flight attendants came by, but none of them questioned me. Everyone acted like it was perfectly normal for a guy to have a smoking stinking electronic device with a circuit board and wires coming off of it on an airplane. Thankfully, the rest of the flight was uneventful.

I learned my lesson though. Now I make sure to segregate my portable batteries from the rest of my plugs and chargers. I still have no idea how that power adapter managed to find that battery’s USB slot, but now I realize that such an occurrence is a possibility.

The point is, check your bags. You may not haul as many batteries as I do, but if you attended the same conference I did last year, you may have same type of charger. Learn from my mistake.

LPM and Firmware Compatibility

Edit: Check your firmware!

Originally posted April 26, 2016 on AIXchange

Here’s something of interest to those who use live partition mobility (LPM): IBM has created a matrix that shows firmware compatibility for conducting LPM operations between systems:

“Ensure that the firmware levels on the source and destination servers are compatible before upgrading.

In [Table 1], you can see that the first column represent the firmware level you are migrating from, and the values in the top row represent the firmware level you are migrating to. The table lists each combination of firmware levels that support migration.”

Below that first chart is a list that “shows the number of concurrent migrations that are supported per system. The corresponding minimum levels of firmware, Hardware Management Console (HMC), and [VIO servers] that are required are also shown.”

Then there’s a list of restrictions, followed by a table that shows the firmware levels and POWER models that support partition mobility:

“Restrictions:
• Firmware levels 7.2 and 7.3 are restricted to eight concurrent migrations.
• Certain applications such as clustered applications, high availability solutions, and similar applications have heartbeat timers, also referred to as Dead Man Switch (DMS) for node, network, and storage subsystems. If you are migrating these types of applications, you must not use the concurrent migration option as it increases the likelihood of a timeout. This is especially true on 1 GB network connections.
• You must not perform more than four concurrent migrations on a 1 GB network connection. With VIOS Version 2.2.2.0 or later, and a network connection that supports 10 GB or higher, you can run a maximum of eight concurrent migrations.
• From VIOS Version 2.2.2.0, or later, you must have more than one pair of VIOS partitions to support more than eight concurrent mobility operations.
• Systems that are managed by the Integrated Virtualization Manager (IVM) support up to 8 concurrent migrations.
• The Suspend/Resume feature for logical partitions is supported on POWER8 processor-based servers when the firmware is at level 8.4.0, or later. To support the migration of up to 16 active or suspended mobile partitions from the source server to a single or multiple destination servers, the source server must have at least two VIOS partitions that are configured as mover service partitions. Each mover service partition must support up to 8 concurrent partition migration operations. If all 16 partitions are to be migrated to the same destination server, then the destination server must have at least two mover service partitions configured, and each mover service partition must support up to 8 concurrent partition migration operations.
• When the configuration of the mover service partition on the source or destination server does not support 8 concurrent migrations, any migration operation that is started by using either the graphical user interface or the command line will fail when no concurrent mover service partition migration resource is available. You must then use the migrlpar command from the command line with the -p parameter to specify a comma-separated list of logical partition names, or the –id parameter to specify a comma-separated list of logical partition IDs.
• You can migrate a group of logical partitions by using the migrlpar command from the command line. To perform the migration operations, you must use -p parameter to specify a comma-separated list of logical partition names, or the –id parameter to specify a comma-separated list of logical partition IDs.
• You can run up to four concurrent Suspend/Resume operations.
• You cannot perform Live Partition Mobility that is both bidirectional and concurrent. For example, [when] you are moving a mobile partition from the source server to the destination server, you cannot migrate another mobile partition from the destination server to the source server.”

Note that if you do not check your firmware versions, a firmware update can cause future planned LPM operations to fail. That’s all the more reason to add this link to your planning checklist.

Speaking of LPM, Chris Gibson takes note of a new HMC system-wide setting that allows LPM with inactive source storage VIO server.

“A new HMC & Firmware 840 feature allows LPM in a dual VIOS configuration when one VIOS is failed. Previously, LPM was not allowed if one of the source VIOS (in a dual VIOS configuration) was in a failed state. Both [VIO servers] had to be operational to perform LPM. The new support allows the HMC to cache adapter configuration from both [VIO servers]. Whenever changes are made to the configuration, the cached information will be updated on the HMC. If one VIOS is failed, instead of querying the failed VIOS, the HMC cache is used instead to create the new configuration on the target VIOS. This support was needed to cover the situation where there’s failed hardware which is causing an outage on the VIOS and requires a disruptive repair action. This new feature is enabled using a server wide HMC setting to enable the automatic caching of VIOS configuration details.”

Why Don’t We Have Root on the HMC?

Edit: I still want root.

Originally posted April 19, 2016 on AIXchange

For as long as there’s been an HMC, there have been frustrated administrators wishing they had root access to it.

The argument for root does contain a certain logic. The HMC runs Linux under the covers, so shouldn’t we, as UNIX admins, have fewer restrictions on what we’re able to do? We have root access (via oem_setup_env) on VIO servers and AIX and Linux LPARs, so why don’t we have root on the HMC? Of course, I’ve yet to meet a system admin who doesn’t believe he needs to have root on everything he touches. It’s our nature.

I recall some early versions of HMC code providing greater default access to the hscroot user. I’d certainly load things up and run them directly on the HMC. I’d play around with the window manager and load VNC and various software packages and generally do what I wanted since I had root access.

In retrospect, this probably wasn’t a great idea on my part. Having too many things running on the HMC makes it a support nightmare. If something isn’t working, is it because of the actual HMC code or hardware, or is the problem one of your pet tools or programs? If you’re IBM, locking down this critical piece of the Power Systems infrastructure and treating it like an appliance makes it much easier to support.

There are forum threads going back to at least 2005 where users share knowledge about getting root on the HMC. It’s tougher to find working information these days, but there are still methods for getting root that don’t involve IBM Support. Naturally people aren’t as willing to discuss them, because when these techniques do get out, they tend to be quickly invalidated.

Now, IBM Support does allow you to reset HMC access passwords. (Note: In the early days of this blog I wrote about getting the celogin password from support, but this isn’t the same as getting root.)

It’s also possible to get access to the product engineering shell (pesh) and get root if there’s a real need to do so. Honestly, after years of HMC enhancements and refinements, there aren’t many legitimate reasons for needing root at this point. Still, if you need to debug or perform other types of maintenance as root, you can contact IBM Support and follow these instructions:

“pesh provides full shell access to the HMC for product engineering and support personnel. pesh takes the serial number of the HMC machine or unique ID of the virtual HMC where full shell access is requested, then prompts the user for a one day password obtained from the support organization. If the password is valid, the user is granted full shell access. Only the hscpe user can run this command.

To obtain full shell access to a Hardware Management Console (HMC):
pesh serial-number-of-HMC-machine

To obtain full shell access to a virtual HMC:
pesh unique-ID-of-virtual-HMC”

The other thing to keep in mind is root isn’t necessary for dealing some common HMC management issues. Are your filesystems filling up? Try this. Are you dealing with some crazy hscroot syntax? Check out EZH, which makes the HMC command line easier to manage. (Here’s an introductory video.)

So do you want root on your HMC? Why or why not?

Coverage of IBM’s OpenPOWER Summit Announcements

Edit: Is POWER making inroads?

Originally posted April 12, 2016 on AIXchange

Last week I was in Austin for a Linux on Power workshop, when, as the kids say, my Twitter timeline blew up with news from the OpenPOWER Summit in San Jose.

Appropriately enough, as I started to write this, I saw tweets from Nigel Griffiths and David Spurway that referred to IBM’s “unusual” announcement.

I think part of what’s driving interest in this topic is that IBM typically keeps its cards close to the vest. The company seldom chooses to publicly reveal its future plans prior to announcements and general availability. Of course, many industry observers (myself included) have attended briefings where IBM tells you what’s ahead, but in those cases they’ve always made us sign NDAs. So, such public talk about POWER9 processors, which won’t be available until well into 2017, is indeed pretty surprising. Then consider Google’s involvement — they’ve never been forthcoming about their use of POWER — and you can see why this is such a big deal. Industry watchers, even those who primarily cover Microsoft or Apple, are realizing that Linux on Power solutions and POWER8 performance are worth paying attention to.

Anyway, for those of you who aren’t on Twitter, I’ll cite some of the articles covering the announcements relating to IBM POWER8 and POWER9 processors.

The Register:

“OpenPower Summit IBM’s POWER9 processor, due to arrive in the second half of next year, will have 24 cores, double that of today’s POWER8 chips, it emerged today.

Meanwhile, Google has gone public with its Power work – confirming it has ported many of its big-name web services to the architecture, and that rebuilding its stack for non-Intel gear is a simple switch flip.

The POWER9 will be a 14nm high-performance FinFET product fabbed by Global Foundries. It is directly attached to DDR4 RAM, talks PCIe gen-4 and NVLink 2.0 to peripherals and Nvidia GPUs, and can chuck data at accelerators at 25Gbps.

The POWER9 is due to arrive in 2017, and be the brains in the U.S. Department of Energy’s Summit and Sierra supercomputers.

Google says it has ported many of its big-name web services to run on Power systems; its toolchain has been updated to output code for x86, ARM or Power architectures with the flip of a configuration flag.

Google and Rackspace working together on Power9 server blueprints for the Open Compute Project. These designs are compatible with the 48V Open Compute racks Google and Facebook are working on.

The blueprints can be given to hardware factories to turn out machines relatively cheaply, which is the point of the Open Compute Project: driving down costs and designing hardware to hyper-scale requirements. Rackspace will use the systems to run POWER9 workloads in its cloud.

The system itself is codenamed Zaius: a dual-socket POWER9 SO server with 32 DDR4 memory slots, two NVlink slots, three PCIe gen-4 x16 slots, and a total core count of 44. And what’s not to like? For one thing: high-speed NVlink interconnects between CPUs and Nvidia GPU accelerators, which Google likes to throw its deep-learning AI code at.”

The Next Platform:

“Google, as one of the five founding members of the OpenPower Foundation in the summer of 2013, is always secretive about its server, storage, and switching platforms, absent the occasional glimpse that only whets the appetite for more disclosures. But at last year’s OpenPower Summit, Gordon McKean, senior director of server and storage systems design and the first chairman of the foundation, gave The Next Platform a glimpse into its thinking about Power-based systems, saying that the company was concerned about the difficulty of squeezing more performance out of systems, and his boss, Urs Hölzle, senior vice president of the technical infrastructure team, confirmed to us in a meeting at the Googleplex that Google would absolutely switch to a Power architecture for its systems, even for a single generation, if it could get a 20 percent price/performance advantage.

Maire Mahoney, engineering manager at Google and now a director of the OpenPower Foundation, confirmed to The Next Platform that Google does indeed have custom Power8 machines running in its datacenters and that developers can deploy key Google applications onto these platforms if they see fit. Mahoney was not at liberty to say how many Power-based machines are running in Google’s datacenters or what particular workloads were running in production (if any). What she did say is that Google “was all in” with its Power server development and echoed the comments of Hölzle that if the Power machines “give us the TCO then we will do it.”

The POWER8 chips got Google’s attention because of the massive memory and I/O bandwidth they have compared to Xeon processors, and it looks like Google and the other hyperscalers have been able to get IBM to forge the POWER9 chip in their image, with more cores and even more I/O and memory bandwidth. “The vision is to build scale out server systems taking advantage of the amazing I/O subsystem that the OpenPower architecture delivers,” Mahoney added.

We happen to think that Rackspace would have done something like Zaius on its own, but the fact that Google is helping with the design and presumably will deploy it in some reasonable volumes means that the ecosystem of manufacturing partners for the Zaius machines should be larger than for Barreleye. And with IBM shipping on the order of several tens of thousands of Power systems a year at this point, if Google and Rackspace dedicate even a small portion of their fleets to Power, it would be a big bump up in shipments.”

I received links to these articles in a group email to IBM Champions:

Bloomberg: 

“Google also said it’s developing a data center server with cloud-computing company Rackspace Hosting Inc. that runs on a new IBM OpenPower chip called POWER9, rather than Intel processors that go into most servers. The final design will be given away through Facebook Inc.’s Open Compute Project, so other companies can build their data center servers this way, too.”

Fortune: 

“The search giant [Google] said on Wednesday that, along with cloud computing company Rackspace, it’s co-developing new server designs that are based on IBM chip technology.”

IDG News Service: 

“Two years ago, Google showed a Power server board it had developed for testing purposes, though it hadn’t said much about those efforts since. It’s now clear that Google is serious about using the IBM chip in its infrastructure.”

San Antonio Business Journal: 

“The two tech giants are using an open source server created by IBM called the POWER9 processor. It is among more than 50 new products being developed across 200 technology companies as part of the OpenPOWER Foundation, an industry controlled nonprofit dealing with the reality and cost of big data demands.”

TechRepublic: 

“The benefit of the Power architecture goes beyond price for performance. Because of the architectural limitations of x86-64, Intel has faced substantive difficulty pushing the number of threads in a processor. Intel’s 22-core Xeon E5-2699 v4 is limited to 44 threads, whereas the 12-core POWER8 has 96 threads.”

ZDNet: 

“The explosion of data requires systems and infrastructures based on POWER8+ accelerators that can both stream and manage the data and quickly synthesize and make sense of data, IBM said about the UM [University of Michigan] partnership.”

The Next Platform: 

“IBM Unfolds Power Chip Roadmap Out Past 2020.”

As a POWER bigot, I love it when mainstream tech outlets acknowledge the benefits of the technology I know and love. And I’m excited to think that this publicity will lead to new customers potentially choosing Linux on Power over x86 solutions.

Migrating to POWER8 Systems

Edit: Hopefully now you are migrating to POWER9

Originally posted April 5, 2016 on AIXchange

You just found out you’re getting new hardware. But hold the celebration — how do you get your existing LPARs to run on it?

This document covers migration paths for AIX systems to POWER8 systems. It’s “intended as a quick-reference guide in transitioning an existing AIX system from prior POWER architectures to a POWER8 system. The focus is on AIX itself, not the application stack.”

The document shows a graphical chart which covers migration paths including Live Partition Mobility, NIM alt disk migration, update/migration installs, versioned WPARs, etc.

Here’s more:

“Which options are available to me?

For AIX 5.3 and earlier
You’ll need to migrate to a POWER8-supported level. … there are fundamentally 3 options in this case:

1. NIM alt disk migration
2. Migrate in-place, then either mksysb, alt_disk_copy, or Live Partition Mobility (if going from POWER6 or POWER7 system).
3. Create mksysb of 5.2 or 5.3 system, install supported 7.1 on POWER8 system, and create AIX 5.2 or 5.3 Versioned WPAR from the mksysb.

For AIX 6.1 or 7.1
You have the option of doing an AIX update to a supported level instead of a migration, though if on AIX 6.1 you may still choose to migrate to 7.1 to get full POWER8 capabilities. Again… there are fundamentally 3 options:

1. If at a level that supports POWER8 and if the system is LPM-capable, Live Partition Mobility can be used to move to the POWER8 system.
2. If at a level that supports POWER8, use mksysb or alt_disk_copy to move to the POWER8 system and AIX update on the POWER8 system only if desired.
3. Update in-place and either mksysb, alt_disk_copy, or Live Partition Mobility (if going from POWER6 or POWER7 system). Note that if alt_disk_copy is chosen, the update can be to the alternate disk rather than in-place.

Partition Mobility is an option for moving partitions dynamically from POWER6/POWER7 to POWER8 systems, provided that the partitions are LPM-capable. Partition Mobility can be performed on both HMC managed systems as well as on Integrated Virtualization manager (IVM) managed systems. The FLRT tool can be used to validate the source and target systems for LPM.

Two types of migration are available depending on the state of the logical partition:
– The migration is active if the mobile partition is in running state.
– The migration is inactive if the mobile partition is shutdown.

Considerations
– POWER6 or POWER7 system is required.
– LPARs must be LPM capable.”

This document can help you decide which method will work best in your environment. It’s worth your time.

Also, don’t forget that there are now options for running AIX 5.3 on POWER8 systems without using a WPAR. If you find yourself in this situation, you can still move up to POWER8.

Another Lifeline for Those on AIX 5.3 Extended Support

Edit: There are still plenty of people on older hardware and software.

Originally posted March 29, 2016 on AIXchange

Nobody likes to admit it, but many customers are still running AIX 5.3 on older hardware. There are many reasons for this. Maybe you have a few LPARs running an older OS. Maybe you’re reliant on a critical application that’s no longer supported. Maybe you’ve fallen so far behind on patching and upgrading that running an old OS is an acceptable risk. Or perhaps it’s simply the stubbornness of an “if it ain’t broke, don’t fix it” mentality.

Whatever your reason, as long as IBM continued to offer extended support for AIX 5.3, you had some peace of mind. Hopefully though, you understood that this wouldn’t last. And now we know: Earlier in March, IBM announced that AIX 5.3 extended support will be discontinued on April 30. However, IBM is throwing customers another lifeline by offering the capability to run AIX 5.3 natively on POWER8 servers:

“Many clients are still using an IBM AIX 5.3 application environment on their IBM Power Systems serves. AIX 5.3 reached end of life in April 2015. However, an extended service contract was offered for 12 months on all supported hardware. This contract will end on April 30, 2016.

Many clients have a subset of applications that are still dependent on a supported AIX 5.3 environment. IBM is enabling AIX 5.3 to run natively on POWER8 servers. The PTF U86665.bff will enable the AIX 5.3 image to run on POWER8 servers and will be available on March 11, 2016.

The LPAR must be at AIX 5.3 TL12 SP9 (latest 5.3 release).

The POWER8 LPAR will run in POWER6 compatibility mode and is limited to SMT2 mode. SMT2 mode results in some capacity loss compared to SMT4/SMT8 mode. IBM publishes SMT2 rPerf values that can be used to quantify POWER8 SMT2 capacity.

AIX 5.3 POWER8 LPARs only support Virtual I/O configurations vSCSI, NIPV, and VLAN.

Only 5.3 POWER8 technology-based system installation methods are supported:

mksysb: First, perform an in-place update to a supported (POWER5/ POWER6/ POWER7) 5.3 TL12 SP9 LPAR with PTF U866665. Standard mksysb command can then be used to capture a POWER8 capable mksysb image. The mksysb image can then be used to install POWER8 LPARs.

NIM: A 5.3 TL12 SP9 NIM environment must be updated to support POWER8. A 5.3 TL12 SP9 NIM lppsource must be updated to include PTF U866665. A NIM SPOT must then be created or updated to utilize the updated lppsource.

All POWER8 systems are supported with the following restrictions:
* POWER8 systems must be at the 840 firmware level.
* POWER8 LPARs must be served by a 2.2.4.10 or 2.2.3.60 VIOS.

Service and support contract: The AIX 5.3 environment on POWER is planned to be supported for a total of 15 months through June 30, 2017. Clients will have to first acquire a service and support contract, after which they will be entitled to download the PTF.”

Of course others outside of IBM have written on this prior to the official announcement (here). It also came up in an IBM training class I’d attended, where I was told not only about the extension, but its benefits:

“This is a new offering and a very good news for customers that have AIX 5.3 applications needing a supported environment for the next 12-15 months. Customers can usually drop down a tier from their existing servers when they move to POWER8, requiring fewer cores due to higher performance, and save on per core service and support costs. The resulting savings are often significant enough to justify investment in new hardware. Moreover, with fewer cores, the customers can save significantly in software license costs as well.”

Chris Gibson has a nice write-up, along with a first look at running a 5.3 LPAR on POWER8 system.

So if you’re one of those holdouts, let’s hear from you: Will this new capability motivate you to migrate your 5.3 LPARs to POWER8? If not, why not?

Finding Minimum AIX Hardware Support Levels

Edit: I still refer to this all of the time.

Originally posted March 22, 2016 on AIXchange

If you just bought an 8408-E8E — otherwise known as the E850 — you may be wondering about its minimum supported AIX versions. Turns out there’s an easy way to find this information: Just go to the System to AIX Maps web page.

Looking down the list, under the POWER8 heading, you’ll find the E850. There are two choices: one for physical adapters, and another for virtualized adapters and the VIO server.

Under the All I/O Configurations link for the E850, there are two options, 7.1 and 6.1:

    Technology Level    Base Level    Recommended Level        Latest Level
    7100-02        7100-02-06    7100-02-07            7100-02-07
    6100-08        6100-08-06    6100-08-07            6100-08-07

Under Virtual I/O Only, there are several options:

    Technology Level    Base Level    Recommended Level        Latest Level
    7200-00        7200-00-01    7200-00-01            7200-00-01
    7100-04        7100-04-00    7100-04-00            7100-04-01
    7100-03        7100-03-01    7100-03-05            7100-03-06
    7100-02        7100-02-01    7100-02-07            7100-02-07
    6100-09        6100-09-01    6100-09-05            6100-09-06
    6100-08        6100-08-01    6100-08-07            6100-08-07

Obviously you can look up many systems besides the E850. Available hardware types go all the way back to POWER4 and beyond, including even older models that only run older versions of AIX. In some cases that old information is not online, but for most of the hardware you’d run, you can at least find the minimum levels that can be run depending on how you set up your I/O.

System to AIX Maps is well worth a bookmark. Be sure to check here to verify that what you’re planning to do will actually work.

POWER Systems? There’s an app for that.

Edit: Do you run the app?

Originally posted March 15, 2016 on AIXchange

It seems like there’s an app for everything related to your Power hardware these days. There’s myHMC Mobile (which I covered here), along with the IBM Redbooks Mobile App. And now there’s the IBM Technical Support Mobile App (available for Android and iPhone):

“The IBM Technical Support mobile app lets clients worldwide quickly and easily access key technical support content and functions for all IBM software and hardware products.

You can use the app to:

  • Expedite troubleshooting by searching for, viewing, and bookmarking technical support content like technotes, APARs, documentation, and Redbooks.
  • View and update your software and hardware Service Request tickets whenever and wherever you need to.
  • Discover the best fixes for your system and email the fix orders using the Fix Level Recommendation Tool.
  • Look up warranty information for hardware systems by scanning the bar code or entering the Machine Type/Serial Number.
  • View Customer Support Plans for your products.
  • Contact IBM, with geo-location assistance and click-to-call.
  • Provide feedback about the app through its Feedback form.”

Installing the app on my Android phone was simple enough. I searched Google Play for IBM Technical Support and it came right up. It did need to access to quite a few permissions. I never know why apps need access to my camera or my photos and files, but I accepted everything so I could test it out.

There are quite a few options on the main page, things like support content where you can search for whatever you need. For a test I just entered S822 and quite a few useful items came back. Being able to do these quick, simple lookups could certainly be handy whenever I’m on a raised floor.

There’s a menu option for service requests. I signed in with my IBM ID and was able to view my open software requests, hardware requests, etc. By selecting Full Site, I was able to open a new PMR from my phone.

Features include support videos, questions and answers (which brings you to IBM developerWorks forums), customer support plans and warranty lookup (where you can scan your server’s bar code). When I entered my machine type and serial number, I received info about my warranty status and system expiration date, along with parts that were shipped with the system.

There’s a menu item that takes you to FLRT LITE. Another option lets you change your settings and language. There are also options to provide feedback and contact IBM.

I found the back button would take me out of the app more often that I liked, but hopefully over time this will be addressed. Overall, I expect I’ll legitimately use this a lot — and I’m not a person who downloads a ton of apps.

If you’ve downloaded and tried the IBM Technical Support Mobile App, let me know what you think.

The Fix (Level) is In: Using FLRT and FLRT LITE

Edit: I still use them both.

Originally posted March 8, 2016 on AIXchange

I’ve mentioned FLRT previously (herehere and here). Hopefully you’ve taken advantage of the tool. On countless occasions it’s helped me determine the latest versions of OSs, firmware and applications, along with end of life, etc.

From IBM:

“The Fix Level Recommendation Tool (FLRT) provides cross-product compatibility information and fix recommendations for IBM products. Use FLRT to plan upgrades of key components or to verify the current health of a system. Enter your current levels of firmware and software to receive a recommendation. When planning upgrades, enter the levels of firmware or software you want to use, so you can verify levels and compatibility across products before you upgrade.”

If you’re new to FLRT, here’s how to get started. First, go to the IBM link above and select your server machine type and model. Then skip down to Partition OS and select AIX, and then select the version you’re running. At this point you can click submit and confirm your AIX level. You’ll also be provided with recommendations for updates and/or upgrades.

That’s just the beginning. FLRT allows you to really drill down and find out the recommendations for your entire stack, including machine firmware, HMC code levels, operating system levels, cluster/virtualization and POWER software, and even disk subsystems.

As useful as FLRT is, for the uninitiated, the tool comes with a learning curve. Getting a response to even a simple query — something like, what is the latest version of AIX? — can be a painful exercise. Fortunately, if you’re just looking for a quick answer to a single question, there’s FLRT LITE.

It is a simple interface, it just asks you to choose from one of these products. Once you click on it, the information you are interested in is at your fingertips:

    Power, PureFlex and Power Blade System Firmware
    HMC and HMC Virtual Appliance
    AIX
    PowerVM Virtual I/O Server
    PowerHA SystemMirror
    Cluster Systems Management
    General Parallel File System
    General Parallel File System Standard Edition
    General Parallel File System Express Edition
    General Parallel File System Advanced Edition
    LoadLeveler
    Parallel Engineering and Scientific Subroutine Library
    Parallel Environment
    Parallel Environment Developer Edition for AIX
    Parallel Environment Runtime Edition for AIX
    PowerVP Standard Edition
    PowerKVM
    Red Hat Enterprise Linux
    SUSE Linux Enterprise Server
    PowerVC Standard Edition
    Spectrum Scale

To learn more about FLRT LITE, read these articles (here and here).

As I said, I use FLRT a lot. How about you? How often do you need to look up OS versions and related information? Do you know of an easier way to get it?

How Does Your Database Rate?

Edit: Do you ever check these?

Originally posted March 1, 2016 on AIXchange

The website db-engines.com rates “database management systems according to their popularity.”

The list has been around for a few years, and as this InfoWorld article notes, “It isn’t forensically precise, nor is it meant to be; it’s intended to give a sense of trends over time.”

The left nav bar contains database rankings by type, including relational databases (IBM’s DB2 is fifth on that list), key-value stores and document stores. You can also see how prevalent open source databases have become.

Here’s how the rankings are calculated:

“The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:

* Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for <system name> together with the term database, e.g. “Oracle” and “database”.

* General interest in the system. For this measurement, we use the frequency of searches in Google Trends.

* Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.

* Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.

* Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

* Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator.”

A blog posting went into further detail about the rankings:

“1) The Ranking uses the raw values from several data sources as input. E.g. we count the number of Google and Bing results, the number of open jobs, number of questions on StackOverflow, number of profiles in LinkedIn, number of Twitter tweets and many more.

2) We normalize those raw values for each data source. That is done by dividing them with the average of a selection of the leading systems in each source. That is necessary to eliminate the bias of changing popularity of the sources itself. For example, LinkedIn increases the number of its members every month, and therefore the raw values for most systems increase over time. This increase, however, is rather due to the growing adoption of LinkedIn and not necessarily resulting from an increased popularity of a specific system in LinkedIn. Giving another example: an outage of twitter would reduce the raw values for most of the systems in that month, but obviously has nothing to do with their popularity. For that reason, we are using a selection of the best systems in each data source as a ‘benchmark’.

3) The normalized values are then delinearized, summed up over all data sources (with weighting the sources), re-linearized and scaled. The result is the final score of the system.

The normalization step is the key to understanding the December results: the top three systems in the ranking (Oracle, MySQL and SQL Server) all increased their score. Oracle and MySQL gained formidable 16 and 11 points respectively. As a consequence the benchmark increased, leading to potentially less points for many other systems.

Why are we not using all systems as a benchmark for a data source? Well, we continuously add new systems to our ranking. Those systems typically have a low score (assuming that we are not missing major players). Then, each newly added system would reduce the benchmark and increase the score of most of the other systems.

Conclusion: it is important to understand the score as a relative value which has to be compared to other systems. Only that can guarantee a fair and unbiased score by eliminating influences of the usage of the data sources itself.”

While my work almost exclusively involves supporting systems with databases that run on AIX, I still find it worthwhile to learn more about other systems and databases. It’s good to know what else customers are working with.

Running LPM on Selected Partitions

Edit: This is still a valuable tidbit.

Originally posted February 23, 2016 on AIXchange

A while back a customer got word from Oracle that they would be charged for every core on a system that could be used for Live Partition Mobility, even cores that weren’t used by their Oracle database.

“IBM Power VM Live Partition Mobility is not an approved hard partitioning technology. All cores on both the source and destination servers in an environment using IBM Power VM Live Partition Mobility must be licensed.”

The customer found LPM very useful for performing maintenance on their hardware and rebalancing their workloads. They didn’t want to give it up, but naturally, they didn’t want to have to license every core on their machines, either.

To address this problem, they were looking for a way to disable LPM on their Oracle LPARs while still allowing their other LPARs to use LPM. Since LPM is enabled at the frame level with PowerVM Enterprise Edition, they were unsure how this could be done. Could they change the SAN zoning for these LPARs so they would be unable to run LPM? Or should they just bite the bullet and buy some smaller servers and completely segregate their Oracle workload onto frames with no LPM available? (They’re also considering migrating off of Oracle altogether.)

This post caught their eye. It describes how LPM can now be disabled on a per partition basis:

“HMC V8R8.4.0 introduces a new partition-level attribute to disable Live Partition Migration (LPM) for that partition. The HMC will block LPM for that partition as long as this attribute is enabled. This feature can be used by ISVs to deal with application licensing issues.

Some applications require the user to purchase a license for all systems that could theoretically host the running LPAR. That is, if an LPAR can theoretically be migrated (whether you do so or not) to a total of 4 managed systems, you may be required to purchase a software license for all 4 systems. If you don’t plan on ever migrating the LPAR hosting the application, then this attribute provides an audit-able mechanism to prevent the LPAR from ever being migrated. It should be noted that no IBM software has such licensing requirements.

One benefit of this attribute implementation is it is not dependent on the managed server firmware version so you can use this feature from the HMC enhanced+ GUI, REST API, or CLI on any system the HMC is managing.

One thing to note is that while NovaLink will honor this attribute in a co-managed environment, it does not provide anyway to alter the value.

Any change to this attribute is logged as a system event, and can be checked for auditing purposes. A system event will also be logged when the Remote Restart or Simplified Remote Restart capability is set. More specifically, a system event is logged when:

* any of these three attributes are set during the partition creation
* any of these three attributes are modified
* restoring profile data

Users can check system events using the lssvcevents CLI and/or the “View Management Console Events” GUI. Using HMC’s rsyslog support, these system events can also be sent to a remote server on the same network as the HMC.

1) Command to check which partitions managed by this HMC have LPM disabled or enabled

    lssvcevents -t console | grep vclient

time=10/30/2015 10:11:32,text=HSCE2521 UserName hscroot: Enabled partition migration for partition vclient10 with Id 10 on Managed system ct05 with MTMS 8205-E6D*1234567.

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

2) Command to check which partitions managed by this HMC have LPM disabled

    lssvcevents -t console | grep HSCE2520

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

3) Command to check which partitions managed by this HMC have LPM disabled or enabled for a particular managed server (1234567)

    lssvcevents -t console | grep “partition migration for partition” | grep 1234567

time=10/30/2015 10:11:32,text=HSCE2521 UserName hscroot: Enabled partition migration for partition vclient10 with Id 10 on Managed system ct05 with MTMS 8205-E6D*1234567.

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

4) Command to check if a specific partition (vclient9) in a specific managed server (1234567) managed by this HMC has LPM disabled or enabled

    lssvcevents -t console | grep “partition migration for partition vclient9” | grep 1234567

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567″

Have you ever wanted or needed to disable LPM for specific LPARs, either due to an Oracle mandate or some other reason? Let me know in the comments.

Proud to be a (Returning) Champion

Edit: I am still proud to be a Champion.

Originally posted February 16, 2016 on AIXchange

Last fall I wrote about the relaunch of the IBM Champions program. Here’s how I described it back in 2011:

“Apple fanboy” is a moniker that’s sometimes given to those who love Apple products. Along those lines, I guess I’m a “Power fanboy.” I love the platform and the operating systems that run on it. I love the virtualization capabilities, the performance and the reliability. And, as readers of this blog surely know by now, I love telling others about Power Systems servers. I’ve been reading the articles and following the tweets of other Power Champions for some time, which makes me all the more proud to be included in this group and recognized for my efforts.

I was very proud to be chosen as an IBM Champion nearly five years ago, and I’m just as proud to be one of the 17 returning champions — among 34 selections overall — in 2016:

After much reviewing of applications and evaluating contributions, we’re happy to announce the 2016 IBM Champions for IBM Power! We have a total of 34 Champions, with 17 new Champions, and 17 returning Champions.

Congratulations to our 2016 IBM Champions for Power!

These individuals are non-IBMers who evangelize IBM solutions, share their knowledge, and help grow the community of professionals who are focused on IBM Power systems. IBM Champions spend a considerable amount of their own time, energy and resources on community efforts — organizing and leading user group events, answering questions in forums, contributing wiki articles and applications, publishing podcasts, sharing instructional videos, and more!

IBM Champions are also granted access to key IBM business executives and technical leaders to share their opinions, learn about strategic plans, and ask questions. In addition, they may be offered various speaking opportunities that enable them to raise their visibility and broaden their sphere of influence.

Look for an in-depth article on the IBM Champion program and profiles of some of the new IBM Champions for Power in the April issue of IBM Systems magazine.

It’s an honor for me, but I’m also happy for all the deserving recipients. I’ve learned so much from the other 33 people on this list, and I look forward to learning more from them in the future:

Congratulations to:

  •     Torbjorn Appehl
  •     Balazs Babinecz
  •     Aaron Bartell
  •     Alberto C. Blanch
  •     Shawn Bodily
  •     Benoit Creau
  •     Shrirang “Ranga” Deshpande
  •     Waldemar Duszyk
  •     Anthony English
  •     Pat Fleming
  •     Nigel Fortlage
  •     Susan Gantner
  •     David Gibbs
  •     Ron Gordon
  •     Midori Hosomi
  •     Tom Huntington
  •     Terry Keene
  •     Andy Lin
  •     Alan Marblestone
  •     Christian Masse
  •     Pete Massiello
  •     Rob McNelly
  •     Brett Murphy
  •     Jon Paris
  •     Mike Pavlak
  •     Trevor Perry
  •     Steve Pitcher
  •     Billy Schonauer
  •     Brian Smith
  •     David Tansley
  •     Paul Tuohy
  •     Jeroen Van Lommel
  •     Dave Waddell
  •     Charles Wright

Single-User Mode vs. Maintenance Mode

Edit: Avoid a resume generating event.

Originally posted February 9, 2016 on AIXchange

Recently I was telling a customer about the differences between booting into single-user mode and booting into maintenance mode. If you’re not familiar with these procedures, I recommend either using an existing LPAR or creating a new LPAR and trying them both. But before you do that, check out two valuable IBM support technotes (FAQs) that walk through each method.

This document tells you how to boot AIX to single-user mode to perform maintenance. (Note: You’ll need to know the root password to do this):

In AIX we don’t tend to use single-user mode very much, because many problems require having the rootvg filesystems unmounted for repairs. However, there are some instances when it’s beneficial to use single-user:

  • The system boot hangs due to TCP/IP or NFS configuration issues
  • [To] do work on non-root volume groups
  • To debug problems with entries in /etc/inittab
  • To work on the system without users attempting to log in
  • To work without applications starting up
  • It is easy to unmount /tmp and /var if they need to be checked with fsck or recreated

If the system boots fine from the rootvg, then booting into single-user to repair or perform work has advantages:

  • It boots quicker than Maintenance Mode.
  • You can boot off the normal system rootvg without finding AIX Install media or setting up a NIM SPOT.
  • It allows you to run all commands you would normally have access to in multiuser.
  • Unlike maintenance mode, there is no possibility that hdisks will be renamed.

Procedure
Standalone System (no HMC):
1. Boot system with no media in the CD/DVD drive
2. Wait until you see the options of choosing another boot list, and hear beeps on the console
3. Press 6 to start diagnostics.

System using an HMC:
1. Select the LPAR in the HMC GUI
2. Select Operations -> Activate
3. In the Activate window, click the button that says “Advanced”
4. Change “Boot mode” to “Diagnostic with stored boot list”
5. Click “OK” to save that change, then “OK” again to activate.

More menu options follow, so be sure to read the whole thing.

This doc tells you how to boot into maintenance mode on AIX systems to perform maintenance on the rootvg volume group or restore files from an mksysb backup.

There is a variety of media that can be used to boot an AIX system into Maintenance Mode. These consist of:
    1.A non-autoinstall mksysb taken from the same system or another system running the same level of AIX, either on tape or CD/DVD media.
    2.AIX bootable installation media (CD or DVD).
    3.A NIM server with a SPOT configured, and set up to boot this machine for maintenance work.

For certain work it is important to have the exact same level (AIX version, Technology Level, and Service Pack) on the boot media as is installed on disk. In these cases if the system is booted with different levels, the rootvg filesystems and commands may not be available to use.

This portion of the doc is found under the heading, Maintenance Mode Options:

At this point a decision must be made.

Option 1 will attempt to mount the rootvg filesystems and load the ODM from /etc/objrepos. If this works you will have full access to the rootvg filesystems and ODM, so you may run commands such as bosboot, rmlvcopy, syncvg, etc. If the version of AIX you have booted from (either from media or NIM SPOT) is not exactly the same as on disk, this will error and fail to mount the filesystems.

Option 2 will import the rootvg and start an interactive shell before mounting any filesystems. This interactive shell has very few commands available to it. As it has not mounted any filesystems from the rootvg it does not have access to rootvg files or the ODM. Use this option when performing maintenance on the rootvg filesystems themselves, such as fsck, rmlv, or logform.

    1) Access this Volume Group and start a shell
    2) Access this Volume Group and start a shell before mounting filesystems

This portion of the doc is found under the heading, Notes on Maintenance Mode:

1.The terminal type is not usually set up correctly for using the vi editor (in Option 2 only). To set this type:
    # export TERM=xterm

2.If you mount any rootvg filesystems (either automatically under Option 1 or by hand under Option 2) and change any files you must manually sync the data from filesystem buffer cache to disk. Normally the syncd daemon does this for you every 30 seconds, but no daemons are running in maintenance mode. To sync the data type:
    # sync; sync; sync

3.Typically there is no network connectivity in maintenance mode, so FTP or telnet are not available.

4.If you are in Option 2 with no filesystems mounted and wish to mount the filesystems and load the ODM you can type:
    # exit

Leaving Maintenance Mode
If you have chosen Option 2 and you have not mounted any filesystems by hand, just shut down the LPAR (via the HMC) or power off a standalone server. If you are ready to boot AIX to multiuser then activate the LPAR or if a standalone server power it on via the front panel.

If you have chosen Option 1 type these commands to reboot the system:
    # sync; sync; sync; reboot

Again, read both documents in their entirety to learn more. And because preparation is always worthwhile, I’ll add that it’s always a good time to verify that you have good mksysb images that you could use if needed. 

Speaking of the importance of preparation, not long ago I heard from someone whose server failed to bootup after a recent power outage. They were booting from local disks and didn’t have their rootvg mirrored. They did not have backups. They did have a very a bad day. Some folks refer to this type of situation as an RGE, or a resume generating event. With minimal effort now, you can avoid the same fate.

New Solutions to the Age-Old Problem of Memory Errors

Edit: Yet another reason to look at Enterprise class hardware.

Originally posted February 2, 2016 on AIXchange

This article made the rounds on Twitter awhile ago. It’s worth your time if you haven’t read it:

Not long after the first personal computers started entering people’s homes, Intel fell victim to a nasty kind of memory error. The company, which had commercialized the very first dynamic random-access memory (DRAM) chip in 1971 with a 1,024-bit device, was continuing to increase data densities. A few years later, Intel’s then cutting-edge 16-kilobit DRAM chips were sometimes storing bits differently from the way they were written. Indeed, they were making these mistakes at an alarmingly high rate. The cause was ultimately traced to the ceramic packaging for these DRAM devices. Trace amounts of radioactive material that had gotten into the chip packaging were emitting alpha particles and corrupting the data.

Once uncovered, this problem was easy enough to fix. But DRAM errors haven’t disappeared. As a computer user, you’re probably familiar with what can result: the infamous blue screen of death. In the middle of an important project, your machine crashes or applications grind to a halt. While there can be many reasons for such annoying glitches—including program bugs, clashing software packages, and malware—DRAM errors can also be the culprit.

For personal-computer users, such episodes are mostly just an annoyance. But for large-scale commercial operators, reliability issues are becoming the limiting factor in the creation and design of their systems.

Most consumer-grade computers offer no protection against such problems, but servers typically use what is called an error-correcting code (ECC) in their DRAM. The basic strategy is that by storing more bits than are needed to hold the data, the chip can detect and possibly even correct memory errors, as long as not too many bits are flipped simultaneously. But errors that are too severe can still cause machines to crash.

There was some unquestionably good news. For one, high temperatures don’t degrade memory as much as people had thought. This is valuable to know: By letting machines run somewhat hotter than usual, big data centers can save on cooling costs and also cut down on associated carbon emissions.

One of the most important things we discovered was that a small minority of the machines caused a large majority of the errors. That is, the errors tended to hit the same memory modules time and again.

The bad news is that hard errors are permanent. The good news is that they are easy to work around. If errors take place repeatedly in the same memory address, you can just blacklist that address. And you can do that well before the computer crashes.

When you consider all the effort that goes into making today’s servers even more reliable, I think it’s even more impressive to consider how IBM has designed Power Systems. From the E870/E880 Redbook:

2.3.6 Memory Error Correction and Recovery
The memory has error detection and correction circuitry is designed such that the failure of any one specific memory module within an ECC word can be corrected without any other fault.
In addition, a spare DRAM per rank on each memory port provides for dynamic DRAM device replacement during runtime operation. Also, dynamic lane sparing on the DMI link allows for repair of a faulty data lane.

Other memory protection features include retry capabilities for certain faults detected at both the memory controller and the memory buffer.

Memory is also periodically scrubbed to allow for soft errors to be corrected and for solid single-cell errors reported to the hypervisor, which supports operating system deallocation of a page associated with a hard single-cell fault.

2.3.7 Special Uncorrectable Error handling
Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in memory or cache from immediately causing the system to terminate. Rather, the system tags the data and determines whether it will ever be used again. If the error is irrelevant, it does not force a checkstop. If the data is used, termination can be limited to the program/kernel or hypervisor owning the data, or freeze of the I/O adapters controlled by an I/O hub controller if data is to be transferred to an I/O device.

4.3.10 Memory protection
The memory buffer chip is made by the same 22 nm technology that is used to make the POWER8 processor chip, and the memory buffer chip incorporates the same features in the technology to avoid soft errors. It implements a try again for many internally detected faults. This function complements a replay buffer in the memory controller in the processor, which also handles internally detected soft errors.

The bus between a processor memory controller and a DIMM uses CRC error detection that is coupled with the ability to try soft errors again. The bus features dynamic recalibration capabilities plus a spare data lane that can be substituted for a failing bus lane through the recalibration process. The buffer module implements an integrated L4 cache using eDRAM technology (with soft error hardening) and persistent error handling features.

For each such port, there are eight DRAM modules worth of data (64 bits) plus another DRAM module’s worth of error correction and other such data. There is also a spare DRAM module for each port that can be substituted for a failing port.

Two ports are combined into an ECC word and supply 128 bits of data. The ECC that is deployed can correct the result of an entire DRAM module that is faulty. This is also known as Chipkill correction. Then, it can correct at least an additional bit within the ECC word.

The additional spare DRAM modules are used so that when a DIMM experiences a Chipkill event within the DRAM modules under a port, the spare DRAM module can be substituted for a failing module, avoiding the need to replace the DIMM for a single Chipkill event.

Depending on how DRAM modules fail, it might be possible to tolerate up to four DRAM modules failing on a single DIMM without needing to replace the DIMM, and then still correct an additional DRAM module that is failing within the DIMM.

In addition to the protection that is provided by the ECC and sparing capabilities, the memory subsystem also implements scrubbing of memory to identify and correct single bit soft-errors. Hypervisors are informed of incidents of single-cell persistent (hard) faults for deallocation of associated pages. However, because of the ECC and sparing capabilities that are used, such memory page deallocation is not relied upon for repair of faulty hardware.

Finally, should an uncorrectable error in data be encountered, the memory that is impacted is marked with a special uncorrectable error code and handled as described for cache uncorrectable errors.

The Reliability, Availability, and Serviceability characteristics that are built into Power hardware (not just the memory subsystem) is just one of the many reasons I enjoy working on these systems.

Thoughts on SAP HANA’s Availability on Power Systems

Edit: Still the best place to run it.

Originally posted January 26, 2016 on AIXchange

I assume you’ve heard by now that SAP HANA is available on IBM Power Systems.

With this release, SAP HANA on IBM Power Systems is supported for customers running SAP Business Warehouse on IBM Power Systems. This solution is available on SUSE Linux, for configurations initially scaling-up to 3TB. This is available within the Tailored Datacenter Integration (TDI) model, which will enable customers to leverage their existing investments in infrastructure.

A large pharmaceutical company had a 100X improvement in query performance, and an 88% reduction in ETL execution time compared to what they had running the same workload on their legacy database. In another instance, a large energy provider saw a 95% reduction in query response times compared to running those same queries against a legacy database.

There was also this interesting post from Alfred Freudenberger, North America Power Systems sales executive, IBM Power Systems for SAP Environments. Some highlights:

In November, 2015, SAP unleashed a large assortment of support for HoP. First, they released a first of a kind support for running more than 1 production instance using virtualization on a system. For those that don’t recall, SAP limits systems running HANA in production on VMware to one, count that as 1, total VMs on the entire system.

SAP took the next step and increased the memory per core ratio on high end systems; i.e. the E870 and E880, to 50GB/core for BW workloads thereby increasing the total memory supported in a scale-up configuration to 4.8TB.

What does this mean for SAP customers? It means that the long wait is over. Finally, a robust, reliable, scalable and flexible platform is available to support a wide variety of HANA environments, especially those considered to be mission critical. Those customers that were waiting for a bet-your-business solution need wait no more.

Here’s another perspective:

In that blog, he did an excellent job of explaining how technical enhancements at a processor and memory subsystem level can result in dramatic improvement in the way that HANA operates. Now, I know what you are thinking; he likes what Dr. Plattner has to say about a competitor’s technology? Strange as it may seem, yes … in that he has pointed out a number of relevant features that, as good as Haswell-EX might be, POWER8 surpassed, even before Haswell-EX was announced.

All of these technical features and discussion are quite interesting to us propeller heads. Most business people, on the other hand, would probably prefer to discuss how to improve HANA operational characteristics, deliver flexibility to respond to changing business demands and meet end user SLAs including response time and continuous availability. This is where POWER8 really shines. With PowerVM at its core, Power Systems can be tailored to deliver capacity for HANA production to ensure consistent response time and peak load capacity during high demand times and allow other applications and partitions to utilize capacity unused by the HANA production partition. It can easily mix production with other production and non-production partitions. It features the ability to utilize shared network and SAN resources, if desired, to reduce data center cost and complexity. POWER8 delivers unmatched reliability by default, not as an option or a tradeoff against performance.

By comparison, SAP has only one certified benchmark for which HANA systems have been utilized called BW-EML. Haswell-EX cpus were used in the 2B row Dell PowerEdge 930 benchmark and delivered an impressive 172,450 Ad-hoc Navigation Steps/Hr. This is impressive in that it surpassed the previous IvyBridge based benchmark of 137,010 Ad-hoc Navigation Steps/Hr on the Dell PowerEdge R920, an increase of almost 26% which would normally be impressive if it weren’t for the fact that the system includes 20% more cores and 50% more memory. By comparison, POWER8 delivered 192,750 Ad-hoc Navigation Steps/Hr with the IBM Power Enterprise System 870 or 12% more performance with 45% fewer cores and 33% less memory resulting in twice the performance per core.

Finally, check this out:

Take for example, the SAP BW Enhanced Mixed Load (BW-EML) Standard Application Benchmark on four-socket servers. This benchmark has documented that POWER8 cores out-perform Haswell EX cores by two times while running SAP HANA analytics workloads.

That’s not even the best part. I have been impressed with the capability of the POWER8 line to scale to much higher core counts. The scaling ability of POWER8-based servers is key to both enabling workload consolidation and removing the need to break large datasets across multiple nodes which would otherwise negatively impact the latency of queries.

Of course, the performance and scaling attributes of Power Systems are only part of the story. The enterprise-grade resiliency and flexible capacity features that Power Systems are known for become increasingly important to clients as they deploy in-memory analytics capabilities. SAP HANA availability across the entire POWER8 product line allows our existing clients to quickly and easily extend these benefits to HANA by simply allocating additional capacity on their infrastructure.

We continue to collaborate and partner with SAP to optimize and tune in-memory database performance for Power Systems, including further leveraging of SIMD instructions, transactional memory, and other acceleration features in POWER. With the successes we’ve seen in running these challenging in-memory workloads on our enterprise-class servers, we’re off to a great start, one that clients are sure to find highly beneficial while balancing the explosion of data in their day to day business operations.

If your enterprise is considering deploying SAP HANA, have you thought about running it on Power Systems?

Another HMC Goody: myHMC Mobile

Edit: Does anyone use this?

Originally posted January 19, 2016 on AIXchange

After trying out the HMC virtual appliance (vHMC), I wanted to examine the myHMC mobile application. The app, which came out last summer, is designed to allow you to manage HMC devices from your phone.

For more, watch this video, and read  Appendix A from this IBM Redbook.

myHMC is an Android or iOS application that lets you connect to and monitor managed objects on your Power systems Hardware Management Console (HMC). Monitoring includes the status of your Managed Systems, Logical Partitions/Virtual Machines and VIO servers. The application also allows you to view Resource Groups, Serviceable Events and Performance Data.

Since I have an Android phone, I downloaded the app from Google Play. Apple users can download an iOS version from iTunes.

Once you install myHMC there is a built in demo HMC inside the app for you to play with, though if you have the proper network connectivity and user ID and password information, you should be able to connect it to your own HMC.

I went ahead and connected the myHMC app on my phone to the vHMC running in VMware on my local network — although obviously I’d need to VPN in or have my mobile device connected to a corporate network in order to use it there. It’s a minimal interface, but it does provide a useful read-only view of HMC information.

You can see your managed systems, VIO servers, logical partitions and resource groups in the Resources section of the app. The errors and notifications section displays your serviceable events, and allows you to drill down for details about events and errors. The more information section provides the HMC serial number, machine type, HMC code version and build level.

A dashboard view displays the HMCs that are online, the attention LEDs, the events and the status of managed systems — including whether they’re powered on, operating, initializing or in standby mode. In the logical partitions view, the options are not-activated, running, suspended, open firmware and migrating running.

Under settings, you can find information such as how to use this app, release notes and open source licenses. How to use this app brings you to six pages of information, including screen shots that help you understand how to navigate (although it’s fairly self-explanatory once you try it out). You’re told you can switch between your HMCs and your dashboard view. Use the + key to add an HMC (obviously you’d first need to set up the HMC to allow remote connections and remote operation just like you would normally set up to allow remote access to your HMC). Individual HMCs can be edited or deleted by holding the corresponding icon, while application settings are available from the overflow menu icon. To send the developers feedback about the application, simply shake your phone while the app is running.

Again, all of the information in the application is read-only. At least I didn’t see any way to modify anything on my HMC from the application. Perhaps you found something that I’ve overlooked? Be sure to let me know what you find as you use the app.

What do you think? Do you have the connectivity you need into your data center to make this application useful to you?

Testing Out the New vHMC

Edit: Do you use this in your environment?

Originally posted January 12, 2016 on AIXchange

Have you ever wished you could run HMC code on your laptop? Sure, there are unsupported work-arounds, but the new HMC virtual appliance (vHMC) makes this task much simpler to accomplish, and it’s supported by IBM. Read all about it in section 3.2 of this Redbook.

Although IBM designed the vHMC solution for use in data centers — either as a backup to an existing physical HMC or by itself as a primary HMC solution — I wanted to know if it would actually run on a laptop. In truth, I just had to know.

Before getting into the install process, a bit about the vHMC itself. It allows you to manage Power servers from your existing VMware environment. In addition, some high availability solutions can be set up around your vHMC VM in existing VMware environments. This is especially useful for smaller customers that want to manage one or two smaller Power servers (located either on-site on remotely) without the need for dedicated HMC hardware.

The FAQs and general info that follows can be found in this document. I recommend reading the entire thing.

Support for vHMC firmware, including how-to and usage, is handled by IBM software support similar to the hardware appliance. When contacting IBM support for vHMC issues specify “software support” (not hardware) and reference the vHMC product identification number (PID: 5765-HMV).

How-to, install, and configuration support for the underlying virtualization manager is not included in this offering. IBM has separate support offerings for most common hypervisors which can be purchased if desired.

Q: How can I tell if it’s a vHMC?
A: To determine if the HMC is a virtual machine image or hardware appliance, view the HMC model and type. If the machine type and model is in the format of “Vxxx-mmm,” then it is a virtual HMC.

From command line (CLI) use the lshmc -v command and check the *TM field for a model starting with “V” and/or the presence of the *UVMID fields… .

Q: Are existing HMC customers entitled to vHMC?
A: No. vHMC is a separate offering and must be purchased separately. There is no conversion and no upgrade offering at this time.

Q: Are there any restrictions related to on-site warranty support for managed servers?
A: Restrictions are similar to the hardware appliance. You must supply a workstation or virtual console session located within 8 meters (25 feet) of the managed system. The workstation must have browser and command line access to the HMC. This setup allows service personnel access to the HMC. You should supply a method to transfer service related files (dumps, firmware, logs, etc) to and from the HMC and IBM service. If removable media is needed to perform a service action, you must configure the virtual media assignment through the virtualization manager or provide the media access and file transfer from another host that has network access to HMC.

Q: Can the vHMC be hosted on IBM POWER servers?
A: No, the current offering is only supported on Intel hardware. See release notes for the requirements.

Q: Is DHCP/private network supported?
A: Automatic configuration of a private DHCP network interface at install time by the activation engine is not supported. Manually configuring a private DHCP network using the HMC GUI/CLI is supported the same as with the hardware appliance. Note that a private DHCP network requires an isolated network to the managed server FSPs. Using the hypervisor to configure an isolated private network is outside the scope of vHMC. As with the hardware appliance, vHMC does not support VlAN tagged packets.

As noted in these installation instructions, the vHMC supports the kernel-based virtual machine (KVM) and VMware virtualization hypervisors. Here are the minimum requirements for running it:

  • 8 GB of memory
  • 4 processors
  • 1 network interface (maximum of 4 allowed)
  • 160 GB of disk space (recommended: 700 GB to get adequate performance and capacity monitoring (PCM) data)

Note: The processor on the systems that host the HMC virtual appliance must be either an Intel VT-x or an AMD-V hardware virtualization-enabled processor.

In my test environment I saw tolerable performance with 1 CPU and 4G of memory. Of course I wouldn’t recommend running it that way in production.

Also remember: The vHMC itself isn’t monitored. IBM has no visibility into all the different types of hardware on which this code could run:

Callhome for serviceable events with a failing MTMS of the virtual HMC appliance are not called home to IBM hardware service. The virtual HMC appliance is a software only offering with no associated hardware as provided in the HMC hardware appliance. Serviceable events reported against the vHMC appliance can be reported manually to IBM software support by phone or the IBM service web site.

Callhome for serviceable events on the managed servers and partitions, which will have “Failing MTMS” of the server, works the same on the virtual HMC as on the hardware appliance.

I had a copy of VMware workstation on my laptop, and I first tried the KVM version of the vHMC code inside of a Linux VM running KVM. First, I had to get the code. After confirming that I had entitlement, I went to the ESS website, clicked on entitled software, and then selected software downloads. When prompted for my operating system, I selected other. This brought up the 5765-HMV Power HMC Virtual Software Appliance. After selecting that, I was able to choose:

    tar.gz Download README
    TGZ, ESD – Virtual HMC V8.8.4 for VMware 11/2015
    TGZ, ESD – Virtual HMC V8.8.4 for KVM 11/2015

The actual files that were downloaded were named:

    README_for_tar_gz_Downloads_3-2007.tar.gz
    ESD_-_Virtual_HMC_V8.8.4_for_VMware_112015.tar.gz
    ESD_-_Virtual_HMC_V8.8.4_for_KVM_112015.tar.gz

After unzipping and untarring the files, I fiddled around with nested VMs to see if I could get the KVM vHMC code working in a Linux VM that was running inside of VMware:

Most hypervisors require hardware-assisted virtualization (HV). VMware products require hardware-assisted virtualization for 64-bit guests on Intel hardware. When running as a guest hypervisor, VMware products also require hardware-assisted virtualization for 64-bit guests on AMD hardware. The hardware-assisted virtualization features of the physical CPU are not typically available in a VM, because most hypervisors (from VMware or others) do not virtualize HV. However, Workstation 8, Player 4, Fusion 4, and ESXi 5.0 (or later) offer virtualized HV, so that you can run guest hypervisors which require hardware-assisted virtualization.

With virtualized HV enabled for the outer guest, you should be able to run any guest hypervisor that requires hardware-assisted virtualization. In particular, this means that you will be able to run 64-bit nested guests under VMware guest hypervisors.

Although I checked the correct box to enable nested virtualization, in my early tests the performance was too sluggish to get much done. I got vHMC to boot inside of KVM from within VMware, but it was far simpler to just run it in KVM or VMware natively.

I still wanted to get the KVM version to work, so I loaded Redhat Linux Enterprise Edition on an old standalone desktop machine. I copied the KVM file over to the Linux machine, and clicked on create a new virtual machine. The directions came right from the Redbook cited at the beginning. I selected import an existing disk image, left my OS as generic, set my memory and CPU settings, gave it a name, and clicked on finish. It came right up just as expected. Then I switched over and concentrated on my VMware instance.

To get vHMC to deploy in VMware, I clicked on one of the files (vHMC.ova) that was uncompressed from the VMware tarball I previously downloaded. I was prompted for the name of my new virtual machine and the path where it would live on the disk. I then clicked on import.

From there, everything else happened automagically. It was set to use thin provisioned disk, which only took up a little space on my machine, about 8G or so. By selecting “power on this virtual machine,” my vHMC came right up.

I did the next steps on both the KVM and VMware versions. I was first prompted to change my locale, so I told it to exit and not prompt again. I did likewise when prompted about changing my keyboard. Finally, I was asked to accept the license agreement. In short, everything worked pretty much as it would in a fresh HMC install on standalone hardware.

After accepting the license, the guided setup prompts came up. I skipped over that and the Callhome setup, since neither is necessary for my sandbox environment. At least that’s what I thought. It turns out though that the guided setup is where you create a password for your hscroot user ID. Having not done so, I couldn’t logon. So I rebooted the VM and tried again, this time running the guided setup. I chose my timezone, set up a password for hscroot and root, and skipped over the option to create other users.

For my networking I used an open network with an address of 127.0.0.1, and skipped over firewall settings. I told it I didn’t want to configure another adapter. In addition, I didn’t change the hostname or the domain name. I didn’t put in a gateway address or a gateway device. I told it I didn’t want to use DNS, and I skipped over setting up the SMTP server. Then I clicked on finish. After closing the wizard, it reset the GUI and allowed me to login as hscroot. (Each time you login you get a “tip of the day,” which is another thing I skipped.) Finally, I looked at my HMC version and indeed saw I was running 8.8.4.0, on a model type Vxxx-mmm.

My sandbox performance is pretty good, especially considering that this is an undersized VM that competes for resources with other VMs. On top of that, my test machine only has 8G of physical RAM installed. Obviously in a lab environment performance isn’t really a priority. In a real environment this code would be as snappy as you’d find on dedicated hardware.

Another nice thing is that suspend and resume functions the same as it does in other VMs you might be used to on KVM or VMware. It’s a simple matter to get it out of the way to free up resources; then when you’re ready to get back to it, you pick right up where you left off.

Finally, I appreciate that the process of installing fixes seems identical to what we’re used to with standalone HMCs. Since my VMware internal switch was set up to give my vHMC an address, I obtained one using DHCP by going into my HMC network adapter settings. I changed that setting from a fixed IP to DHCP and got it on the network. Then I was able to go into updates by selecting update HMC. IBM Fix Central didn’t show any vHMC updates, but there were regular HMC updates (MH01560 and MH01588), so I tested those on my sandbox server. Everything worked fine.

There remain many sound reasons to have an isolated standalone management machine serve as your infrastructure point of control. For starters, I believe that the KISS concept still has its advantages when it comes to managing critical hardware. However, the vHMC does offer another option for managing our machines, and I’m sure that adoption will grow as users get more comfortable with it. Testing it out in your environment will be the first step.

Can you see yourself using this solution in place of a dedicated HMC? Please share your thoughts in comments, along with any requests for other tests I can run with the vHMC.

Simon Scripts

Edit: Still good stuff.

Originally posted January 5, 2016 on AIXchange

For years I’ve been asking you to send me scripts. Sharing your scripting abilities benefits us all. We can use them as is, or as a starting point to create scripts that could help others.

Sometime I find scripts — take these, for instance (herehere and here). Regardless of their origin, I share them when I get permission from their authors. It’s a win-win.

With that in mind, here’s a script that Simon Taylor recently sent me:

Simple script to check/extend dump device. If I wanted to get fancy, I would cron it, read errpt output, and limit the size of dumpdev based on free space in rootvg. But then it wouldn’t be simple.

    #!/usr/bin/ksh
    primary=`sysdumpdev -l | awk ‘/primary/ {print $2}’|cut -d / -f3`
    echo primary is $primary
    ppsz=`lsvg rootvg|awk ‘/SIZE:/ {print $6}’`
    estimated=`sysdumpdev -e|awk ‘{print $NF/1024/1024/’$ppsz’}’`
    let estimated=estimated+1
    real=`lslv $primary|awk ‘/^LPs:/ {print $2}’`
    echo estimated size is $estimated, real is $real
    if [ $real -lt $estimated ] ; then
            let extend=estimated-real
            let extend+=1
            if [ “$1” = extend ] ; then
                    extendlv $primary $extend && echo “extended dump device”        
            else
                    echo ‘call with arg “extend” to extend dump device’
            fi
    fi

I asked Simon if he had other scripts he could share, and he provided several. They’re packaged in this tarball. What follows is from his README file, which is included in the .tar file below.

A collection of (hopefully) useful scripts organized in a couple of directories.

All the scripts that initiate communication with remote hosts assume that they run from an account that has a root ssh key on the remote host.

scripts directory

 menu 

– a simple menu program written in perl in the late 90’s as an antidote         to compiled menu programs with licenses and incomprehensible menu
formats. Documentation in the script and menu file.
Call with a menu name otherwise script looks for
         $(dirname $0)../menus/main.mnu

 qdump

– korn shell script to display the difference in disk blocks between
    the size of the system dump device and the output of “sysdumpdev -e”.
    Call with arg “extend” to extend the dump device to estimated size + 1

menus directory
main.mnu – a sample menu for the menu script. Will display this readme and run the scripts. Does not require the .mnu suffix.
Try scripts/menu
——————————————————————————–

doc_vio_disks directory

 doc_vio_disks – maps vio server and client disks. Reports on misconfiguration.
 Call with arg vio_server_name.
 Script will find the partner vio server, the managing hmc and the clients.

  Assumptions:
  1. User account running the command has root keys on lpars and vio servers
  2. User account running the command has hscroot keys on managing hmcs
  3. Frame has two vio servers and both serve vscsi disks to clients

 support_scripts – subfolder containing scripts used by doc_vio_disks
   chk.disks – collects disk info : name, lun, storage serial, size, vg
       display_vio_diskmap – displays output
   disp_vdisk – collects and formats client disk data
   knock.pl – general purpose script to test connection to ip/socket pair
   get_vioserver_data – collects and formats selected prtconf type data
   get_device_map – collect hmc device map

 doc_vio_disks consists of all these bits because originally it was meant to
 answer the question “What’s the next free disk on the vio servers?”. It started as a means to parse data collected manually from the vio servers and the hmc.  I normally display output in two windows side by side to make errors/problems show up.

——————————————————————————–

pmksysb
 pmksysb – script to pull a mksysb from a server using ssh and a fifo
 pmksysb_client – pushed to the client to run the mksysb

 Written to avoid the annoyance of trying to manage nfs mounts and distributed cron jobs. Can be controlled from the central server using a simple script and file containing “day of month” “local target directory” “server name”.
 The simple script invokes pmksysb on “day of month”, writes “server name” mksysb data to local “target directory”.

 This is the help displayed if pmksysb is run without arguments:
pmksysb -c client_hostname
        [ -d local_directory (default /export/mksysb) ]
        [ -f local_file (default client.mksysb) ]
        [ -v ] verify the local mksysb output
        [ -o ] overwrite existing local_file
        [ -n ] skip mkszfile on client
        [ -s ] skip mkvgdata on client
        [ -z ] gzip the completed mksysb
        [ -k kill_time (default 1 hour) ]
        [ -m mail_file ] merge mail for transmission to someone

Take a mksysb of a remote client using a named fifo.
Also runs savewpar on wpar clients.
The -k flag is meant to be used to prevent the client task from being
killed within 1 hour (done to prevent orphan processes on slow systems)

Behaviour is further modified by optional environment variables
RUNAS – local user with root ssh key on remote system (default root)
LOCAL_BIN – local location of pmksysb_client (default /usr/local/bin)
MKSYSB_DIR – local directory which will receive mksysbs (default /export/mksysb)

——————————————————————————–

where directory
 where_is – looks for a server in the file written by hierarchy.pl and displays
            where the server is.

 where_is a_server  a_server found on a_frame, hmc is a_hmc, vio is a_vio another_vio

 hierarchy.pl

– collects cec and lpar data from hmcs writes a list containing hmc, cec, lpar info

 Sample crontab entry for hierarchy.pl
# crontab entry – midday, because most systems will be up.
# This is why where_is fools with fuser on the data file in case it is
# still being written

0 12 * * * /some/location/hierarchy.pl/some/location/full_hmc.names

# /some/location/full_hmc.names just contains hmc names, one per line

st7392020@gmail.com
——————————————————————————–“

As always, feel free to send me your scripts and I’ll happily share them.

The Important Work of Certification Test Writing

Edit: Some links no longer work.

Originally posted December 22, 2015 on AIXchange

I’ve once again been working with teams that are updating various certification tests. I enjoy the interaction with tech pros from around the world as we devise test questions and answers.

As I wrote in my previous post on this topic:

The first thing I noticed was the strict confidentiality required for all team members. We were not to discuss questions or answers with anyone outside of the team for any reason. The last thing we want to do is allow a test taker to get access to the questions and answers. If people are able to cheat their way through an exam, it lessens the value of the certification for those who pass the exam legitimately.

Detecting cheating, or “non-independent test taking” (NITT), has become an even bigger deal since the time I wrote those words:

NITT is any circumstance when an exam is not taken independent of all external influence or assistance.

Non-Independent Test Taking (NITT) is a breach of IBM Test Security and is a serious violation of IBM Professional Certification Testing Practices

If you have taken an IBM certification exam, and it is determined you did NOT test independently: Your certification (if awarded) will be revoked; resulting in the loss of your certified status. You will be banned from testing, and will not be allowed to take any IBM test.

BEFORE TESTING

DON’T:
1. Use any unauthorized study guides, or other materials, that include the actual certification test questions.
2. Have someone else take the exam for you.

DURING TESTING

DON’T:
1. Talk to others who are testing, or look at their screen
2. Use written notes, published materials, testing aids, or unauthorized material.

AFTER TESTING

DON’T:
1. Disclose any test content.
2. Reproduce the test.
3. Take any action that would result in providing assistance or an unfair advantage to others.

Detecting Non-Independent Test Taking

IBM (and many IT certification programs) has devised methods to detect the use of resources containing IBM certification test questions. Through complex data forensics, we can identify a NITT violation. The forensic analysis is based on a variety of factors and different elements of the testing results. (IBM does not rely on any single piece of data.)

It is important that in addition to a standard review of the overall test results, there are multiple aspects of response patterns that are analyzed. The psychometrics of the test performance is evaluated. And IBM also includes independent statistical analysis in making the determination. Based on this rigorous evaluation, IBM can make the NITT decision with certainty and with an unchallengeable degree of confidence.

IBM takes notification of NITT very seriously. A notice is sent only when the conclusion is unmistakable.

By sending a violation notice, IBM has determined, undoubtedly, the testing candidate had access to the test questions (and used the questions) to prepare for the exam. The status of Pass or Fail does not matter. Also, it is not relevant whether the use of questions, from the certification exam, was intentional (or unintentional). In all cases, the fact is the test-taker had reviewed the questions from the certification exam, prior to taking the test.

This video highlights some of the reasons you might want to become certified.

When others learn that I’m involved in writing certification test questions, the typical response is to jokingly ask for copies of the questions. While I get where the humor is coming from, it really isn’t funny to me, because I understand the value of an honestly earned certification. There are ramifications for asking for or distributing questions and answers, and they exist for good reason.

Calculating Hypervisor Memory Overhead

Edit: Some links no longer work.

Originally posted December 15, 2015 on AIXchange

A customer recently contacted IBM Support, wondering how much memory the hypervisor could be expected to consume in their real world environment. Even given how inexpensive memory has become and how convenient it is to add and modify partitions as needed, customers can benefit by planning for their expected workloads as well as their hypervisor and VIO server memory overhead.

Of course this is hardly a new topic. About a decade ago, the LPAR validation tool helped customers obtain this sort of information. This interesting article from 2004 mentions hypervisor memory overhead:

Aside from the memory configured for a partition, additional memory is used for the Hypervisor, translation control entries (TCE) memory and page tables. The Hypervisor is firmware on the pSeries LPAR-capable systems, which helps ensure that partitions only use their authorized resources. When a pSeries system is running with partitions, 256 MB of memory is used by the Hypervisor for the entire system. There’s additional overhead memory called TCE, which is used for direct memory access (DMA) for I/O devices. For every four I/O drawers on a pSeries system, 256 MB of memory is allocated for TCE memory. The TCE memory isn’t additional overhead specific to partitions. Even AIX systems without partitions use TCE memory, but it’s included in the AIX system memory. Page tables are used to map physical memory pages to virtual memory pages. Like the TCE tables, page tables aren’t a unique overhead for LPARs. In other AIX non-partitioned systems, this overhead memory is part of the memory that AIX allocates at boot. Each partition needs 1/64th of its memory size, rounded to a power of two, for page table space in memory. The amount of page table space that’s allocated is based on the maximum memory setting in the partition’s profile.

Now here’s a recent article on hypervisor page table entries:

When setting memory values, it’s important to remember that the size of the Hypervisor page table (HPT) entries that are used to keep track of the real memory to virtual memory mappings for the LPAR is calculated based on maximum, not desired, memory. This means that common sense needs to be applied to setting maximum memory for an LPAR or Hypervisor memory overhead will be much higher than necessary.

This is the response my customer received from IBM:

The exact algorithm to calculate the amount of memory reserved for PHYP is proprietary information and I cannot send that to you. The official method for calculating that is the “IBM System Planning Tool.” You should adjust the system plan with your information.

Let’s try to do that calculation for your current configuration first:

1. On the HMC select the server in question and from Tasks do Configuration -> System Plans -> Create
2. Once created from the left pane of your HMC select System Plans.
3. Select the created system plan and from tasks -> Export System Plan
4. On the new window select “Export to this computer from the HMC” radio buttion and click OK

Are you doing this calculation to get an idea of hypervisor overhead on your systems, or do you simply make sure the system has plenty of available memory and keep in mind that some percentage will be going to system overhead?

The Costs of Technical Debt

Edit: Still an important concept to understand.

Originally posted December 8, 2015 on AIXchange

As often as I see it, it still surprises me when I encounter a company that depends on some application, but chooses to run it on unsupported hardware without maintenance agreements and/or vendor support. If anything goes sideways, who knows how they will stay in business.

Another situation that isn’t uncommon involves time-sensitive projects, new builds where settings or changes are identified and added to a change log. It’s supposed to get taken care of in a few days, but you know the drill. Somehow the changes aren’t made, and before you know it the machine is now production. The build process is over and users are on to testing or development.

Then there are the innumerable enterprises that continue to run old hardware, old software, old operating systems or old firmware. Why is this the case? Are business owners not funding needed updates and changes? Is it a vendor issue? Sometimes vendors go out of business or discontinue support of back versions of their solutions. In smaller shops, maybe one tech cares for the system, and no one else has any idea what’s being done to keep things running. This becomes a problem if that one tech leaves. Then there’s the all-purpose excuse: “If it isn’t broke, why fix it?”

There’s actually a name for this: technical debt:

Technical debt (also known as design debt or code debt) is a recent metaphor referring to the eventual consequences of any system design, software architecture or software development within a codebase. The debt can be thought of as work that needs to be done before a particular job can be considered complete or proper. If the debt is not repaid, then it will keep on accumulating interest, making it hard to implement changes later on. Unaddressed technical debt increases software entropy.

Analogous to monetary debt, technical debt is not necessarily a bad thing, and sometimes technical debt is required to move projects forward.

As a change is started on a codebase, there is often the need to make other coordinated changes at the same time in other parts of the codebase or documentation. The other required, but uncompleted changes, are considered debt that must be paid at some point in the future. Just like financial debt, these uncompleted changes incur interest on top of interest, making it cumbersome to build a project. Although the term is used in software development primarily, it can also be applied to other professions.

It’s hardly a new term, either. Although this piece, from 2003, focuses on the process of writing software, I think it’s applicable to other areas of IT as well.

Technical Debt is a wonderful metaphor developed by Ward Cunningham to help us think about this problem. In this metaphor, doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice. We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design. Although it costs to pay down the principal, we gain by reduced interest payments in the future.

The metaphor also explains why it may be sensible to do the quick and dirty approach. Just as a business incurs some debt to take advantage of a market opportunity developers may incur technical debt to hit an important deadline. The all too common problem is that development organizations let their debt get out of control and spend most of their future development effort paying crippling interest payments.

The tricky thing about technical debt, of course, is that unlike money it’s impossible to measure effectively.

The same article cites this 1992 report. (Funny how as quickly as business computers evolve, some of the underlying issues of using them remain with us.)

Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite…. The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organizations can be brought to a stand-still under the debt load of an unconsolidated implementation, object- oriented or otherwise.

Here’s more from the wikipedia link:

“It is useful to differentiate between kinds of technical debt. Fowler differentiates “Reckless” vs. “Prudent” and “Deliberate” vs. “Inadvertent” in his discussion on Technical Debt quadrant.”

There’s also this:

The concept of technical debt is central to understanding the forces that weigh upon systems, for it often explains where, how, and why a system is stressed. In cities, repairs on infrastructure are often delayed and incremental changes are made rather than bold ones. So it is again in software-intensive systems. Users suffer the consequences of capricious complexity, delayed improvements, and insufficient incremental change; the developers who evolve such systems suffer the slings and arrows of never being able to write quality code because they are always trying to catch up.

Finally, this article argues that we aren’t making the leaps and bounds in computing we once did, in part because of technical debt.

A decade ago virtual reality pioneer Jaron Lanier noted the complexity of software seems to outpace improvements in hardware, giving us the sense that we’re running in place. Our computers, he argued, have become more complex and less reliable. We can see the truth of this everywhere: Networked systems provide massive capacities but introduce great vulnerabilities. Simple programs bloat with endless features. Things get worse, not better.

Anyone who’s built a career in IT understands this technical debt. Legacy systems persist for decades. Every major operating system — desktop and mobile — has bugs so persistent they seem more like permanent features than temporary mistakes. Yet we constantly build news things on top of these increasingly rickety scaffolds. We do more, so we crash more — our response to that has been to make crashes as nearly painless as possible. The hard lockups and BSODs of a few years ago have morphed a momentary disappearance, as if nothing of real consequence has happened.

Worse still, we seem to regard every aspect of IT with a ridiculous and undeserved sense of permanence. We don’t want to throw away our old computers while they still work. We don’t want to abandon our old programs. Some of that is pure sentimentality — after all, why keep using something that’s slow and increasingly less useful? More of it reflects the investment of time and attention spent learning a sophisticated piece of software.

What are your thoughts? Is “good enough” actually good enough, or could we be doing more?

Moving an AIX System

Edit: Some links no longer work.

Originally posted December 1, 2015 on AIXchange

If you’re tasked with migrating, duplicating or cloning your system from old to new hardware, how do you go about it?

If the system isn’t too old, and your source systems are virtualized, you may be able to perform a live partition mobility operation. That process is non-intrusive enough that your users may not even realize there’s been a migration (although hopefully they’ll notice that things are running much faster on the new hardware).

Assuming the source system isn’t virtualized or is using internal disks, perhaps a fibre card is available. That way you could allocate some new LUNs to the source system, copy your rootvg data to them, and then swing the LUNs over to your destination. This method works — provided you do it correctly.

This technote covers some things to look for in IBM’s “Supported Methods of Duplicating an AIX System”:

Question: I would like to move, duplicate or clone an AIX system onto another partition or hardware. How can I accomplish this?

Answer: This document describes the supported methods of duplicating, or cloning, an AIX instance to create new systems based on an existing one. It also describes methods known to us that are not supported and will not work.

Q: Why Duplicate A System?

A: Duplicating an installed and configured AIX system has some advantages over installing AIX from scratch, and can be a faster way to get a new LPAR or system up and running.

Using this method customized configuration files, installation of additional AIX filesets, application configurations and tuning parameters can be set up once and then installed on another system or partition.

Supported Methods
1. Cloning a system via mksysb backup from one system and restore to new system.
2. Using the alt_disk_copy command.
3. Using alt_disk_mksysb to install a mksysb image on another disk.

Advanced Techniques
1. Live Partition Mobility
2. Higher Availability Using SAN Services

There are methods not described here, which have been documented by DeveloperWorks. Please refer to the document “AIX higher availability using SAN services” for details.

Non-Preferred Methods
There are other methods that in may not produce a bootable system under some scenarios. When used in a virtual environment or according to the IBM DeveloperWorks document mentioned above, they may be used to replicate or move a rootvg. However if used with directly attached disks (either internal or SAN-based) they may not work.

Some of these methods are:
1. Using a bitwise copy of a rootvg disk to another disk.
2. Removing the rootvg disks from one system and inserting into another.

This also applies to re-zoning SAN disks that contain the rootvg so another host can see them and attempt to boot from them.

Q: Why don’t these methods work?

A: The reason for this is there are many objects in an AIX system that are unique to it; Hardware location codes, World-Wide Port Names, partition identifiers, and Vital Product Data (VPD) to name a few. Most of these objects or identifiers are stored in the ODM and used by AIX commands.

If a disk containing the AIX rootvg in one system is copied bit-for-bit (or removed), then inserted in another system, the firmware in the second system will describe an entirely different device tree than the AIX ODM expects to find, because it is operating on different hardware. Devices that were previously seen will show missing or removed, and the system may fail to boot with LED 554 (unknown boot disk).

Feel free to share your own migration practices in comments.

Getting Started with Spectrum Scale

Edit: Some links no longer work.

Originally posted November 24, 2015 on AIXchange

IBM recently published — and just updated — a Redbook that covers IBM Spectrum Scale (formerly GPFS).

This IBM Redbooks publication updates and complements the previous publication: Implementing the IBM General Parallel File System in a Cross Platform Environment, SG24-7844, with additional updates since the previous publication version was released with IBM General Parallel File System (GPFS). Since then, two releases have been made available up to the latest version of IBM Spectrum Scale 4.1. Topics such as what is new in Spectrum Scale, Spectrum Scale licensing updates (Express/Standard/Advanced), Spectrum Scale infrastructure support/updates, storage support (IBM and OEM), operating system and platform support, Spectrum Scale global sharing – Active File Management (AFM), and considerations for the integration of Spectrum Scale in IBM Tivoli Storage Manager (Spectrum Protect) backup solutions are discussed in this new IBM Redbooks publication.

This publication provides additional topics such as planning, usability, best practices, monitoring, problem determination, and so on. The main concept for this publication is to bring you up to date with the latest features and capabilities of IBM Spectrum Scale as the solution has become a key component of the reference architecture for clouds, analytics, mobile, social media, and much more.

If you’re looking for a shorter time investment, check out this introductory video. It provides an overview of Spectrum Scale and its benefits. It runs about 6 minutes. There’s also a 2-part video series that goes into a little more detail. These vids run about 20 minutes each.

Part one covers the concepts and technology:

This is a technical introduction to Spectrum Scale FPO for Hadoop designed for those who are already familiar with HDFS concepts. Key concepts such as GPFS NSDs, Storage Pools, Metadata, and Failure Groups are covered.

Part two shows you how to set up a simple GPFS cluster:

This is a technical introduction to Spectrum Scale FPO for Hadoop designed for those who are already familiar with HDFS concepts. In this video, I show how the concepts from Part 1 can be applied with a demo of setting up a 2-node cluster from scratch.

I’ve actually been looking for an opening to write about this topic, because I’m seeing more customers running Spectrum Scale. If you’ve used it, please share your experiences in comments.

Replacing Disks with replacepv

Edit: Some links no longer work.

Originally posted November 17, 2015 on AIXchange

IBM developerWorks recently posted this piece about replacing a boot disk in PowerVC.

The developerWorks article mentions the replacepv command, using an example where this was run:

    replacepv hdisk0 hdisk1

I haven’t messed around with replacepv, but once I read about its capabilities, I was impressed:

The replacepv command replaces allocated physical partitions and the data they contain from the SourcePhysicalVolume to DestinationPhysicalVolume. The specified source physical volume cannot be the same as DestinationPhysicalVolume.

Note:
    The DestinationPhysicalVolume must not belong to a volume group.
    The DestinationPhysicalVolume size must be at least the size of the SourcePhysicalVolume.
    The replacepv command cannot replace a SourcePhysicalVolume with stale logical volume unless this logical volume has a non-stale mirror.
    You cannot use the replacepv command on a snapshot volume group or a volume group that has a snapshot volume group.
    Running this command on a physical volume that has an active firmware assisted dump logical volume temporarily changes the dump device to /dev/sysdumpnull. After the migration of logical volume is successful, this command calls the sysdumpdev -P command to set the firmware assisted dump logical volume to the original logical volume.
   The VG corresponding to the SourcePhysicalVolume is examined to determine if a PV type restriction exists. If a restriction exists, the DestinationPhysicalVolume is examined to ensure that it meets the restriction. If it does not meet the PV type restriction, the command will fail.

The allocation of the new physical partitions follows the policies defined for the logical volumes that contain the physical partitions being replaced.

-f Forces to replace a SourcePhysicalVolume with the specified DestinationPhysicalVolume unless the DestinationPhysicalVolume is part of another volume group in the Device Configuration Database or a volume group that is active.
-R dir_name Recovers replacepv if it is interrupted by <ctrl-c>, a system crash, or a loss of quorum. When using the -R flag, you must specify the directory name given during the initial run of replacepv. This flag also allows you to change the DestinationPhysicalVolume.

Some of you may be wondering what took me so long to get on board with replacepv. This functionality has been around awhile now (see here and here). Maybe I heard about it, and forgot. I have done the same type of thing using migratepv or running mirrorvg (though in the latter case requires the extra step of breaking the mirror by removing logical volumes from the disk I wanted to remove).

Going forward though, I’ll be sure to add this to my bag of tricks. I would encourage anyone else who hasn’t used replacepv to do the same.

A Different View of Virtualization

Edit: Still worth considering, and 96G is still pretty small.

Originally posted November 10, 2015 on AIXchange

This article examines the issues VMware and x86 customers face as they try to virtualize their environments:

Server virtualization has brought cost savings in the form of a reduced footprint and higher physical server efficiency along with the reduction of power consumption.

Obviously, we in the Power systems world can take this statement to heart. By reducing our physical server count and consolidating workloads, we can save on power and cooling and all of the other physical things we need for our systems (including network ports, SAN ports, cables, etc.).

A non-technical driver may be the workload’s size. If an application requires the equivalent amount of compute resources as your largest VM host, it would be cost prohibitive to virtualize the application. For instance, a large database server consumes 96 GB of RAM, and your largest physical VM host has 96 GB of RAM. The advantages of virtualization may not outweigh the cost of adding a hypervisor to the overhead of the workload.

One last non-technical barrier is political issues surrounding mission-critical apps. Even in today’s climate, there’s a perception by some that mission-critical applications require bare-metal hardware deployments.

I found this interesting since 96 GB of memory isn’t a lot on today’s Power servers. In addition, with the scaling in both memory and CPU, we can assign some very large workloads to our servers. Though the need to assign physical adapters exclusively to an LPAR is far less than it once was, we still have the option to use the VIO server for some workloads and physical adapters for others. Alternatively, we can use virtual for network and physical for SAN, or vice versa. With this flexibility, we can mix and match things as needed and make changes dynamically. It’s another advantage to running workloads on Power:

It would be unrealistic to think the abstraction that enables the benefits of virtualization doesn’t come at a cost. The hypervisor adds a layer of latency to each CPU and I/O transaction. The more intense the application performance requires, the more impact to the latency.

Since Power Systems are always virtualized, the hypervisor is always running on the system. The chips and the hypervisor are designed for virtualization. The same company designs the hardware, virtualization layer and the operating system. Everything works hand in hand. Even a single LPAR running on a Power frame runs the same hypervisor under the covers. We simply don’t see the kinds of performance penalties that VMware users do:

However, these direct access optimizations come at a cost. Enabling DirectPath I/O for Networking for a virtual machine disables advanced vSphere features such as vMotion. VMware is working on technologies that will enable direct hardware access without sacrificing features.

The same argument around Live Partition Mobility (LPM) could be made for Power systems that have been built with dedicated adapters. The nice thing is that on the fly we can change from physical adapters to virtualized adapters, run an LPM operation to move our workload to another physical frame, and then add physical adapters back into the LPAR. The flexibility we get with dynamic logical partitioning (DLPAR) operations allows us to add and remove memory, CPU, and physical and virtual adapters from our running machine.

As a quick aside, I expect to see even more blurring of the ways we virtualize our adapters as we continue to adopt SR-IOV:

SR-IOV allows multiple logical partitions (LPARs) to share a PCIe adapter with little or no run time involvement of a hypervisor or other virtualization intermediary. SR-IOV does not replace the existing virtualization capabilities that are offered as part of the IBM PowerVM offerings. Rather, SR-IOV compliments them with additional capabilities.

Getting back to the article on VMware and x86 customers, I was surprised by the conclusion. Most of my Power customers are able to virtualize a very high percentage of their workloads:

Complex workloads can challenge the desire to reach 100% virtualization within a data center. While VMware has closed the gap for the most demanding workloads, it may still prove impractical to virtualize some workloads.

Have you found the overhead associated with hypervisors a hindrance to virtualizing your most demanding workloads?

I’d like to pose these questions to you, my readers. How much of your workloads are virtualized? Do you even consider hypervisors or overhead when you think about deploying your workloads on Power?

A List of System Scanning Tools

Edit: Some links no longer work.

Originally posted November 3, 2015 on AIXchange

What kinds of tools do you use to document and check your systems? I’ve written about prtconf, a built-in tool, and hmcscanner, but many other solutions are available.

Here are three software tools that readers have shared with me. I’m not endorsing any of them; my hope is that by listing a few solutions in one place, it will help you conveniently research new options for your own environments.

systemscanaix:

SystemScan AIX can help by identifying problems, mistakes, and omissions made during the build phase, helping you to improve the security, performance, and serviceability of your systems.

(It) consists of a single RPM that can be installed on AIX 5.3, 6.1, or 7.1. It also has separate modules for HMC/IVM, and VIOS, that can be run from cron and silently produce system configuration reports that can then be transferred to another server for analysis.

For details, see the sample report and FAQs.

aixhealthcheck:

AIX Health Check is software that scans your AIX system for issues. It’s like an automated AIX check list. Download it from our website, unpack and run it on your AIX server and receive a full report in minutes. You decide the format: Text, HTML, CSV or XML output. Have the report emailed to you if you like. AIX Health Check is designed to help you pro-actively detect configuration abnormalities or other issues that may keep your AIX system from performing optimally.

See the sample reports and FAQs for more.

cfg2html, a free tool:

Cfg2html is a UNIX shell script similar to supportconfig, getsysinfo or get_config, except that it creates a HTML (and plain ASCII) system documentation for HP-UX 10.xx/11.xx, Integrity Virtual Machine, SCO-UNIX, AIX, Sun OS and Linux systems. Plug-ins for SAP, Oracle, Informix, Serviceguard, Fiber Channel/SAN, TIP/ix, OpenText (IXOS/LEA), SAN Mass Storage like MAS, EMC, EVA, XPs, Network Node Manager and HP DataProtector etc. are included. The first versions of cfg2html were written for HP-UX. Meanwhile the cfg2html HP-UX stream was ported to all major *NIX platforms, LINUX and small embedded systems.

Some consider it to be the Swiss army knife for the Account Support Engineer, Customer Engineer, System Admin, Solution Architect etc. Originally developed to plan a system update, it was also found useful to perform basic troubleshooting or performance analysis. The production of nice HTML and plain ASCII documentation is part of its utility.

Go here for additional information.

Feel free to use the comments to mention other tools and options.

HMC installios Cleanup

Edit: Some links no longer work. Some updates at the bottom.

Originally posted October 27, 2015 on AIXchange

Awhile back, I was called in to assist an IBM i heritage customer that encountered difficulty installing a VIO server from their HMC.

Fortunately, this support document had some helpful information:

This document describes how to cleanup HMC installios after a failure or interruption of the command.

HMC installios process failed or was interrupted before completing, and subsequent installios command fails with a permission error, such as “/tmp/installios.lock : print Operation not permitted.”

1. If a problem occurred during the installation and installios did not automatically unconfigure itself, run the following command to manually unconfigure the installios installation resources.

    installios -u

Some times the command may fail with a “Permission Denied” error or an error similar to the one below. If it does, proceed with the remaining procedure.

    hscroot@hostname:~> installios -u
    nimol_config MESSAGE: Unconfiguring the NIMOL server…
    nimol_config ERROR: The file /etc/nimol.conf does not exist.
    nimol_config MESSAGE: Running undo…
    ERROR unconfiguring nimol_config.

2. Check if any of the following exist. If so, they need to be removed:

    /tmp/installing.lock
    /tmp/installios_cfg.lock
    /tmp/installios_pid

To remove the file(s), you must obtain a “temporary” PESH access code to gain root access by contacting an HMC Software Support Representative at 1-800-IBM SERV. You will need the HMC serial number. …

Once you have root access to the HMC, change the file(s) permissions by running:

    chmod 775 /tmp/<filename>

At this point, you can try ‘installios -u’ again or manually remove the file(s). Then try the installation again.

HMC 7.3.4 has a known issue with lpar_netboot command creating log files in /tmp such that later execution will cause a log file collision resulting in a failure due to permission error. The fix is in HMC 7.3.5 with (mandatory fix) PTF MH01197. For more details, please, contact an HMC Software Support Representative.

In our case, cleaning up from the installation was as simple as running installios –u and then retrying the operation. Sure enough, on the retry, it again hung partway through the install. I guessed that this was the point where the previous attempt had been aborted.

On the HMC I was able to look at the log file:

    /var/log/nimol.log

I found that the install got this far:

    2015-08-13T06:14:35.088694-05:00 ioserver nimol: ,info=initialization
    2015-08-13T06:14:36.037522-05:00 ioserver nimol: ,info=verifying_data_files
    2015-08-13T06:14:41.084288-05:00 ioserver nimol: ,info=prompting_for_data_at_console
    It LPAR was hung at LED 0c48

I was able to open a console to the LPAR and then select the LUN that the VIO server would be installed to. In this case the LUN was being reused, and the installer recognized that a rootvg was already there. Rather than simply auto-overwrite the LUN, we received a warning prompt. It was making sure we actually wanted to overwrite it. I found this behavior pretty slick.

In general, I prefer NIM for installing VIOS, but in this case the alternative was the best choice, given the overall expertise of the people doing the installation. For an IBM i team with no knowledge of AIX or the NIM server, NIM would have been too much trouble.

——-

EDIT: This was where the original post ended. I got an email from an old co-worker from my days at IBM, Vic Walter. He gave me permission to share our conversation.

Hey Rob,

                I hope all is well with you. I am having issues with VIO installs via HMC image failing and ran across your article.

                When I run the cmd…

                                installios -F -e -R default1

                I get an error message….

                                ERROR removing default1 label in nimol_config.

                And am not finding anything about where nimol_config is

                Maybe you can help ?  thx

——-

I replied with:

Were you able to get the PESH password from IBM?  Seems like they would be able to help?  I guess I would run a find command to see if I could find the file..

——-

He replied with:

I do have a case open with IBM and did get the pesh passwords, even running the installios cmd as root also fails.

Find as root did find it.  /usr/sbin/nimol_config is a script, but has no default1 reference in it.

[hmc1 /] # grep default /usr/sbin/nimol_config

                               \rdefaults:

                               \r\t-L    default

                msg “No NIMOL server hostname specified, using %s as the default.\n” “$NIMOL_SERVER”

# Specify the defaults if variables aren’t set.

[[ -z ${LABEL} ]] && LABEL=”default”

——-

After some back and forth, he sent me an update from IBM

——-

Hi Rob,

                Sorry for the delay in responding. IBM’s solution was to shell into the HMC as hscpe (with pw they provided) su – root and run these cmds.

Once you login as root, first perform cleanup of the previous installios attempts with the below commands:
installios -F -e -R default1
installios -u
check for the below lock files and remove them if exist:
ls -la /tmp/installing.lock
ls -la /tmp/installios_cfg.lock
ls -la /tmp/installios_pid

                In my case the 3 files were present and I removed them. After that the HMC was rebooted by one of the AIX admins before I could get back to the VIO install.

When I did get back to the VIO installs all went well.

One other issue is the NIC used in the failing VIO install was not able to network boot off of the HMC for some reason. I borrowed the NIC from the other VIO to complete the install and this is when the failures appeared. This failure of the network boot could have been the original cause of the VIO install fail and incomplete cleanup. I am not real sure here. This is a new frame, but it also was not seeing one of the internal NVMe disks. One slot had “unknown” instead of the usual 800 GB NVMe description. I had the IBM CE reseat things and run diag on the box. He did find the drive not seated properly and otherwise found no issues.

——-

The main reason I wanted to document this was so that in the future, if this post comes up in your search, there will be another option for you to try.

An Underutilized PowerHA Option

Edit: Some links no longer work.

Originally posted October 20, 2015 on AIXchange

Awhile back, IBM’s Chris Gibson offered a PowerHA tip that you might have missed:

You can use SEA poll_uplink method (requires VIOS 2.2.3.4). In this case SEA can pass up the link status, no “!REQD” style ping is required any more.

Yes, you can install VIOS 2.2.3.50 on top of 2.2.3.4.

At the moment I’m not aware any official documentation regarding how to configure SEA poll_uplink in PowerHA environment. I was in touch with Dino Quintero (editor of the PowerHA Redbooks) and his team will update the latest PowerHA Redbook with this information soon.

However, it’s very easy to enable SEA poll_uplink in PowerHA. Configuration steps:

* Enable poll_uplink on ent0 interface (run this command for all virtual interfaces on all nodes):
    # chdev -l ent0 -a poll_uplink=yes -P
* This change requires a reboot.
Check ent0 and the uplink status:
    # lsattr -El ent0 | grep poll_uplink
    poll_uplink yes Enable Uplink Polling True
    poll_uplink_int 1000 Time interval for Uplink Polling True
    # entstat -d ent0 | grep -i bridge
    Bridge Status: Up

* Enable poll uplink in CAA / PowerHA:
     # clmgr -f modify cluster MONITOR_INTERFACES=enable
* Run cluster verification and synchronization.
* Finally, start PowerHA cluster.

In response, another IBMer, Shawn Bodily, tweeted that he’d updated the PowerHA wiki with this information.

That prompted Chris to post this information:

From Chris we read:

I wanted to mention a new AIX feature, available with AIX 7.1 TL3 (and 6.1 TL9) called the “AIX Virtual Ethernet Link Status” capability. Previous implementations of Virtual Ethernet do not have the ability to detect loss of network connectivity.

For example, if the VIOS SEA is unavailable and VIO clients are unable to communicate with external systems on the network, the Virtual Ethernet adapter would always remain “connected” to the network via the Hypervisors virtual switch. However, in reality, the VIO client was cut off from the external network.

This could lead to a few undesirable problems, such as, a) needing to provide an IP address to ping for Etherchannel (or NIB) configurations to force a failover during a network incident, lacking the ability to auto fail-back afterwards, b) unable to determine total device failure in the VIOS and c) PowerHA fail-over capability was somewhat reduced as it was unable to monitor the external network “reach-ability.”

The AIX VEA Link Status feature provides a way to overcome the previous limitations. The new VEA device will periodically poll the VIOS/SEA using L2 packets (LLDP format). The VIOS will respond with its physical device link status. If the VIOS is down, the VIO client times out and sets the uplink status to down.

To enable this new feature you’ll need your VIO clients to run either AIX 7.1 TL3 or AIX 6.1 TL9. Your VIOS will need to be running v2.2.3.0 at a minimum (recommend 2.2.3.1). There’s no special configuration required on the VIOS/SEA to support this feature. On the VIO client, you’ll find two new device attributes that you can configure/tune. These attributes are:

    poll_uplink (yes, no)
    poll_uplink_int (100ms – 5000ms)

Here’s some output from the lsattr and chdev commands on my test AIX 7.1 TL3 partition that show these new attributes.

   # oslevel -s
   7100-03-01-1341
   # lsattr -El ent0 | grep poll
   poll_uplink     no             Enable Uplink Polling                    True
   poll_uplink_int 1000          Time interval for Uplink Polling           True
   # lsattr -El ent0 -a poll_uplink
   poll_uplink no Enable Uplink Polling True
   # lsattr -Rl ent0 -a poll_uplink
   no
   yes
   # lsattr -Rl ent0 -a poll_uplink_int
   100…5000 (+10)

Although Chris first mentioned this in March and brought it up again this summer, I’m not sure many of you are aware of this option. Even some PowerHA guys I reached out to didn’t know about it, so this information seems well worth sharing.

The Simplest Script

Edit: Send me your scripts.

Originally posted October 13, 2015 on AIXchange

I was recently working with someone who had built some new LPARs. As part of the build out he decided his NIM server would make a good general purpose utility server. This NIM server would become a one-stop shop where he planned to stage fixes along with the base OS images he’d use to create his environment.

During the build out, he needed to get console access to servers so he could, for example, configure networking. That meant logging into the HMC and then running vtmenu. However, this extra step of logging into the HMC was taking too long.

He set it up so that he could ssh with keys to all of the LPARs in the environment, including the VIO servers and the HMCs from the NIM server. This became his central point of control. He could get anywhere by just logging into the NIM server first. (Obviously it then becomes critical to lock down NIM server access to prevent individual users from freely roaming this environment, but this can be accomplished easily enough.)

While these articles (herehere and here) note that vtmenu works fine for getting a console, it’s actually my preferred method of gaining console access. But why go to the hassle of logging into your HMC if you can just do it from your utility server?

Always interested in saving extra steps, my colleague went ahead and set up a simple script on his utility LPAR. Let me emphasize the word “simple” — this script is just a single line in /usr/local/bin:

    ssh -t hscroot@<hmc-ip-address> “vtmenu”

This works because he can log into the HMC without a password using his ssh keys. It brings him directly to the list of managed servers that you’d expect. From there, he can pick the frame and LPAR he wants to see. (Note: Of course, <hmc-ip-address> would need to be replaced with your actual HMC IP address for use in your environment.)

One way this could be further automated is to provide the capability to go through the script on the HMC /usr/hmcrbin/vtmenu and find different commands to run. For example:

    lssyscfg -r sys –Fname

      lssyscfg -m <machine name> -r lpar  -Fname

These commands would enable your own commands to run as they do in the script:

    mkvterm -m <machine name> -p <partition_name>

While further modifications weren’t needed in this case, I’d still like to see something that behaves this way. So if you’re willing to share your own time-saving scripts, I’d love to take a look. You may not consider your scripts to be suited for anything other than what you’re doing, but that’s not necessarily the case. We can all learn from one another.

IBM Announcements Including AIX 7.2 and New Linux Servers

Edit: Some links no longer work.

Originally posted October 5, 2015 on AIXchange

What version of AIX are you running? At conferences and events where presenters ask for attendees to put up their hands I am seeing fewer and fewer shops running AIX 5.3, AIX 5.2 or older versions of AIX. They are migrating to newer hardware and newer versions of PowerVM and the AIX operating system.

I have been to some training lately that reiterated a good point, when you run AIX on IBM Power servers, you get your products and support from one vendor. Instead of worrying about compatibility among your server and your hypervisor and your operating system, or issues where overhead from your hypervisor can introduce performance penalties, you run an integrated stack from the hardware through the firmware into the operating system. You take advantage of IBM’s mainframe heritage and virtualization options that are designed into the hardware instead of bolted on in software. The massive memory bandwidth and threads per core that are available with POWER8 and the latest operating system versions that exploit the hardware are unmatched by the competition. In my training I heard some differentiators for AIX and Power compared to other operating systems and environments. AIX usually runs key workloads, while other operating systems are used for less-critical applications. AIX and POWER8 offer better performance and better scaling than other platforms. With PowerVM, organizations have many opportunities for server and workload consolidation with the ability to tightly “pack” these servers and run at high average utilizations.

Until now, our options for running AIX on POWER8 were AIX 5.2 and AIX 5.3 in versioned WPARs, AIX 6.1 and AIX 7.1.

IBM’s latest announcement brings us to AIX version 7.2, which will provide for Live Update for Interim Fixes, Server Based Flash Caching, 40G RoCE for Oracle RAC performance and vNIC adapters we can use with SRIOV adapters that will provide for quality of service (QOS) settings. The vNIC will also help us use SRIOV adapters with Live Partition Mobility, which was one of the drawbacks of SRIOV before. vNIC will be more efficient than using shared Ethernet adapters (SEA) in our VIO servers. vNIC will also work with AIX 7.1 TL4, so you do not necessarily need to upgrade to AIX 7.2 to take advantage of it.

AIX 7.2 still comes in two varieties: AIX Standard Edition (which includes Dynamic System Optimizer) and AIX Enterprise Edition. Dynamic System Optimizer will be included in the base OS to help us with system tuning, especially on the larger multi-CEC systems. AIX Enterprise edition consists of everything that you find in AIX Standard Edition, but it also includes other IBM software you can use to manage your environment including:

  • PowerVC
  • PowerSC Standard Edition
  • AIX Dynamic System Optimizer
  • Cloud Manager with OpenStack for Power V4.3
  • IBM Tivoli Monitoring V8.1
  • IBM BigFix Lifecycle V9.2

One interesting feature that is coming along is the ability to live update service packs and technology levels. The current AIX hotpatch technology (available since AIX 6.1) is great for certain isolated ifixes, but is not extensible to service packs or technology levels. AIX 7.2 Live Update is a new approach that initially supports only ifixes, but is designed to be extensible to service packs and technology levels in the future.

We will be able to use Coherent Accelerator Processor Interfaces (CAPI) with AIX 7.2, which up to now it has been Linux only. I expect to see more hardware taking advantage of CAPI in the future. By using CAPI, we can reduce the number of instructions that we need to do I/O, instead of talking to an interface and using those drivers, we are going directly from the CPU to a flash storage array, for example.

There will be two different features related to SSDs: LVM Mirroring to Flash and Server Based Flash Caching. These are the key distinctions:
LVM Mirroring to flash uses existing LVM mirroring capability to mirror slower spinning storage to high-speed SSD storage, and then we can specify that the SSD is the preferred mirror for reads. This implies that the SSD must have the same capacity as the spinning storage. This is implemented on both 7.1 (already available in TL3 SP4), and 7.2
Server Based Flash Caching is the ability to use a smaller SSD as a cache for larger and slower spinning storage. This does not rely on LVM mirroring (the storage does not have to be mirrored). Unless you need a full mirror, this would be a more cost-effective solution than mirroring since it does not require as much SSD capacity, but it will provide a similar performance benefit. This is an AIX 7.2-only feature (at least for now).

Other Announcements

Also in these announcements, we find that the newest release of PowerKVM, v3.1, will run in little endian mode. There will be vCPU and memory hot plug support, dynamic micro threading, and SRIOV support.

On the hardware side new Linux only machines were announced:

  • S822LC for High Performance Computing is a 2-socket 2U system with two NVIDIA GPUs.
  • S822LC for Commercial Computing is a 2-socket 2U system with no GPU. It will have up to 20 cores and 1 TB memory with five PCIe slots, four of which are CAPI enabled.
  • S812LC is a 1-socket 2U system with up to 14 large form factor disk drives, which provides for 84 TB of on-board storage. This machine supports up to 10 cores and 1 TB memory with four PCIe slots, two of which are CAPI enabled.

The LC system portfolio will be different from other scale-out servers. Customers will have access to pricing and configurations and will purchase directly from the Web, although they are still welcome to engage with a business partner to help them with their machines. IBM states that it is simple to order these systems. They come with a three-year 8×5 warranty with 100 percent client replaceable parts. Six configurations are available. These systems should be available Oct. 30.

As always, IBM is committed to bringing new hardware and operating system features to its customers, and this announcement is no exception.

For more on these announcements, check out:

Jay Kruemcke’s blog “AIX 7.2 and October Power software announcements”

Recently updated IBM AIX – From Strength to Strength document

Announcement letter: IBM AIX 7.2 delivers the reliability, availability, performance, and security needed to be successful in the new global economy

list of all of today’s announcements

Displaying Virtual Optical Device Info with lsvopt

Edit: Some links no longer work.

Originally posted September 29, 2015 on AIXchange

I have a client that works with virtual optical devices, having built one for each LPAR on its system. The client wanted to know the easiest way to display these devices along with all of the virtual media (both the media already loaded into the devices, and the media available to load).

I’ve covered this topic before (see here and here), but it’s worth revisiting.

The client created the virtual optical devices using this command:

    mkvdev –fbo –vadapter vhostX –dev somename

Media from the DVD drive was copied using this command:

    mkvopt –name cdname.iso –dev cd0 –ro

Media was verified with the lsrep command, which displays the size of the client’s virtual media repository, along with the names, sizes and access of all the .iso images (either ro or rw). (Note: I recommend monitoring the size of your own media repository, particularly if you plan on adding more media.) While similar information can be found with ls –la /var/vio/VMLibrary, lsrep seems a bit more user friendly.

On this project, I worked directly with a guy with an IBM i background and limited familiarity with the VIO server. In my experience it seems that people accustomed to IBM i tend to look for an HMC GUI method to manipulate the VIO server or some other easier way of doing things compared to messing around with UNIX command line stuff. In this instance, he was trying to avoid a couple of common uses of the lsmap command:

* lsmap -vadapter vhostX — This would require him to specify the vadapter parameter and go through the adapters one by one.

* lsmap –all | more — He didn’t want to have to scroll through all of the resulting output.

Fortunately, the lsvopt command provided the alternative to all that pain. With lsvopt, he could inventory the virtual media devices, displaying the name of each device, the media that was loaded, and the size of the loaded media.

Since I mentioned it, note that lsvopt is also a handy when it comes to VIO server upgrades. See this section of the release notes:

Before installing the VIOS Update Release 2.2.3.50
The update could fail if there is a loaded media repository.

Checking for a loaded media repository
To check for a loaded media repository, and then unload it, follow these steps.
    To check for loaded images, run the following command: $ lsvopt
    The Media column lists any loaded media.
    To unload media images, run the following commands on all Virtual Target Devices that have loaded images: $ unloadopt -vtd <file-backed_virtual_optical_device>
    To verify that all media are unloaded, run the following command again: $ lsvopt
    The command output should show No Media for all VTDs.

While a lot of you rely on lsmap, I still run into people who don’t know about lsvopt. Plus, a refresher never hurts.

The IBM Champion Program is Back

Edit: Now I am a Lifetime Champion. Some links no longer work.

Originally posted September 22, 2015 on AIXchange

Back in 2011 I wrote about the IBM Champion program and how happy I was to be one of those recognized. Since that time, the program went on a bit of a hiatus, and there hadn’t been any new nominations for Power Champions (my official designation). Occasionally, someone on Twitter or elsewhere online would ask about the program and when it would be revived.

I’m pleased to report that that time is now:

It’s IBM Champion Season! (nominations are open)

No, that doesn’t mean you get to hunt IBM Champions! What it means is that nominations are now open, so you can nominate IBM Champions for the following areas:

    IBM Social Business (AKA Lotus, ICS, ESS)
    IBM Power Systems
    IBM Middleware (AKA Tivoli, Rational, WebSphere)

When: From September 14 – October 31

How: https://ibm.biz/NominateChamps

The IBM Champion program recognizes innovative thought leaders in the technical community. An IBM Champion is an IT professional, business leader, or educator who influences and mentors others to help them make the best use of IBM software, solutions, and services, shares knowledge and expertise, and helps nurture and grow the community. The program recognizes participants’ contributions over the past year in a variety of ways, including conference discounts, VIP access, and logo merchandise, exclusive communities and feedback opportunities, and recognition and promotion via IBM’s social channels.

Contributions can come in a variety of forms, and popular contributions include blogging, speaking at conferences or events, moderating forums, leading user groups, and authoring books or magazines. Educators can also become IBM Champions; for example, academic faculty may become IBM Champions by including IBM products and technologies in course curricula and encouraging students to build skills and expertise in these areas.

Take the opportunity to nominate an influencer of IBM Social Business, IBM Power, or IBM Middleware, now. Nominations for the 2016 IBM Champion program will be accepted through Midnight Eastern Time, October 31st 2015.

Nominations for IBM Champion are open to candidates worldwide, and candidates can be self-nominated or nominated by another individual. IBM employees are not eligible for nomination.

Tips for a solid nomination:
* Be specific about contributions. They need to be verifiable by either a web search, or by someone at IBM who can confirm the contributions.
* It is not a popularity contest – more nominations does not necessarily boost your chances. Content of the nomination is vital.
* Include links to the nominee’s blog, if applicable for contributions.
* Include the nominee’s twitter handle, if they have one.
* Include the nominee’s email address.
* Stick to contributions for 2015. Nothing prior to that is relevant as contributions are assessed each year.

I’m excited to see the program is back, and I look forward to seeing who will soon be joining the ranks of IBM Power Champions. Who do you plan to nominate?

Sending Log Files to IBM

Edit: Still worth thinking about.

Originally posted September 15, 2015 on AIXchange

Are you sending in log files and snap files to IBM for problem analysis? I usually send my information via FTP, but lately I’ve tried other methods like https or the Java utility. For anyone who’s grown up with GUI, these may be more appealing options.

Learn more about updating PMRs by using the Enhanced Customer Data Repository:

Enhanced Customer Data Repository (ECuRep) is a secure and fully supported data repository with problem determination tools and functions. It updates problem management records (PMR) and maintains full data life cycle management.

This video provides further information:

What follows can be found via the send data tab:

Speed of transfer
While you may send data to any of our addresses, your speed of transfer will be quickest if you use choose the geographic location nearest your physical location.

Americas
The Java and z/OS utilities are fastest
The next fastest methods are FTP and FTPS
Server address: testcase.boulder.ibm.com

Asia Pacific
The Java and z/OS utilities are fastest
The next fastest methods are FTP and FTPS
Server address: ftp.ap.ecurep.ibm.com

Europe
The Java and z/OS utilities are fastest
The next fastest methods are FTP, FTPS and SFTP
Server address: ftp.ecurep.ibm.com

Use this chart to determine which method best suits your needs based on the size of the files you’re transferring. I’ve listed the information below, but believe me, it will make more sense when you consult the chart.

Available methods
If your file size is…
Greater than 2 gigabytes
Less than 2 gigabytes
Less than 20 megabytes

FTP
Yes, both regular and secure FTP methods are supported. Faster Yes, both regular and secure FTP methods are supported. Faster Yes, both regular and secure FTP methods are supported.

HTTPS
Only when using the widget on www.secure.ecurep.ibm.com. Yes, both regular and secure HTTP methods are supported, but we strongly encourage a file limit of 200 megabytes when transmitting data via HTTPS. Yes, both regular and secure HTTPS methods are supported.

Java utility
Yes, all data is transmitted securely using the Java utility. Faster Yes, all data is transmitted securely using the Java utility. Faster Yes, all data is transmitted securely using the Java utility.

Email
No. No. Yes, both regular and secure emails are supported.
1. Gather diagnostic data. Your IBM SSR will inform you what diagnostic data is required.
Your IBM SSR will provide you with a Problem Management Record number (PMR). Write this down.
2. Compress data All diagnostic data delivered electronically to IBM must be in a compressed or packed format following the IBM file naming conventions.

A problem record is identified by its ID which is built out of the PMR <xxxxx> or RCMS/CROSS number <xxxxxxx> , the branch office <bbb> (only mandatory for PMR ticket IDs), and the country code <ccc>.

-File naming convention for PMR tickets:
File naming convention: xxxxx.bbb.ccc.yyy.yyy
Example: 34123.055.724.Filename.zip (<PMR id>.<branch_office>.<country_code>.<filename>)

For further assistance, contact: contact@ecurep.ibm.com .

So for those of you who send in log files, have you been sticking with FTP and the command line, or have you tried another method?

A Troubleshooting Follow-up

Edit: More fun with zsnap.

Originally posted September 8, 2015 on AIXchange

Last week I wrote about the zsnap command and how it can be used to collect information and troubleshoot data for AIX, PowerHA or VIO server. Here’s how to use zsnap with PowerHA SystemMirror:

The following procedures are for data collection, not for problem diagnosis. Gathering this information before calling IBM support can aid in problem determination and save time resolving Problem Management Records (PMRs).

Using zsnap for PowerHA SystemMirror
Run # zsnap –HACMP
This zsnap command gathers PowerHA data and creates the testcase file in one step. If you already have a PMR number, see the example below.

Data
The zsnap command for PowerHA SystemMirror gathers the same information as snap at this time. The data include:

* Data from both nodes
* CAA data (PowerHA 7.1 and up)
* RSCT information (PowerHA 6.1 and lower)
* AIX information: bootinfo, lslpp, emgr, lsdev disk data, lspv lsvg, lsfs, mount, df, lscfg, lsattr on fibre channel adapter, process table, env data
* Network information: netstat -in, netstat -rn, netstat -v, netstat -m, lsdev adapter and interface data, tty, lsattr on network adapters ODM data for both PowerHA and AIX
* Error report
* Configuration files: clhosts, clinfo.rc, harc.net, netmon.cf, rhosts, clip_config,environment, inetd.conf, limits, profile, resolv.conf, snmpd.log, snmpdv3.log, filesystems, inittab, netprobe.log, rc.net, services, snmpd.peers, syslog.conf, clvg_config, hosts, ipHarvest.log, netsvc.conf, rc.nfs, snmpd.conf and snmpdv3.conf, ifrestrict
* AHAFS data
* PowerHA logs: autoclstrcfgmonitor.out, autoverify.log, cell temp log, clverify, clavan.log, cluster.log, clcomd.log, clcomddiag.log,clconfigassist.log, hacmp.out clstrmgr.debug, clstrmgr.debug.long clevents, clevmgrdevents, clinfo.log, clutils.log, clver_CA_daemon_invoke_client.log, clver_debug.log, cspoc.log, dhcpsa.log, dnssa.log, domino_server.log, emuhacmp.out, hacmprd_run_rcovcmd.debug, application monitor logs, smart assistant logs, smit.log, migration.log
* PowerHA data: hostname information, cllsif information, cluster state data, cluster daemon data, resource group information, cluster topology information

Example
See zsnap usage for all available options.
# zsnap –HACMP –pmr 12345,123,123
The example gathers the appropriate data and creates a testcase file with the IBM standard naming convention for quicker processing. You will be prompted to send the file to IBM using the FTP protocol. If you don’t have a PMR number, omit the –pmr flag to build the testcase file.
You can also run the zsnap command from the AIX SMIT menus.

Using snap for PowerHA SystemMirror
The snap command is the standard AIX tool that gathers data and stores that information in /tmp/ibmsupt/. The snap command does not gather the following additional PowerHA related information.

Data
See zsnap Data section above for the data collected by the snap command.

Sample snap procedure for PowerHA
See snap usage for all available options.

Follow these steps to gather the PowerHA data.
1. Run the snap -r command to remove all previously gathered data on all of the nodes in the cluster.
2. Gather the additional information and put it in /tmp/ibmsupt/testcase. You may need to recreate the testcase directory.
3. Run # snap -e on just one node.
4. Rename the testcase file to adhere to IBM testcase file naming conventions, and then send the file to IBM.

Although IBM Support will guide you through the process of collecting and sending data, it’s best to be proactive. You’ll generally resolve the issue more quickly if you do your own troubleshooting.

The First Step in Troubleshooting

Edit: Do you use snap or zsnap more often?

Originally posted September 1, 2015 on AIXchange

If you work on AIX (which you surely do if you’re reading this) and you’ve worked with IBM Support, you’ve probably used the snap command.

But are you familiar with the zsnap command?

The zsnap command is a supplemental tool used by AIX support personnel to gather debugging data. Built around the standard AIX snap command, the zsnap command gathers additional information that the snap command does not provide. You can also use the zsnap command to send a testcase directly to IBM from the machine that generated the testcase data. If needed, the zsnap command can fork multiple calls to the snap command, which results in quicker data gathering than if done via snap.

IBM has a web page that walks you through the troubleshooting process and also demonstrates the many uses of the zsnap command. This page brings you to the index:

    MustGather index
    Cluster AIX Aware problems [CAA]
    Filesystems
    JAVA on AIX
    Installation problems
    Logical Volume Manager problems
    NFS specific problems
    NIM problems
    PowerHA (HACMP) problems
    PowerVM Virtual I/O Server problems
    SAN or device I/O problems
    System crash
    TCP/IP problems

Each entry links to different procedures for gathering information. The MustGather index is a nice place to start if you’re unsure which zsnap options you should use, but all the links display different methods for using zsnap to collect information.

For example, the first entry with CAA issues states:

Using zsnap for CAA
Run # zsnap –CAA
This zsnap command gathers CAA data and creates the testcase file in one step. If you already have a PMR number, see the example below.

+Data
In addition to the information gathered by the snap command, the zsnap command gathers CAA data that include:

    bootstrap repository information
    detailed repository disk data
    CAA tunables data
    lscluster -i, -c, -s, -d, -m
    uname system information
    swinfo information
    CAA syslog log
    +Example

See zsnap usage for all available options.
# zsnap –CAA –pmr 12345,123,123

The example gathers the appropriate data and creates a testcase file with the IBM standard naming convention for quicker processing. You will be prompted to send the file to IBM using the FTP protocol. If you don’t have a PMR number, omit the –pmr flag to build the testcase file.
You can also run the zsnap command from the AIX SMIT menus.

Using snap for CAA
The snap command is the standard AIX tool that gathers data and stores that information in /tmp/ibmsupt/. There are two flags that can be used to gather CAA data with snap: snap caa or snap -e.

+Data
To reduce the possibility of needing to request additional information later, the following information needs to be gathered manually and included in the snap testcase file.
See zsnap Data section above for the information you need to collect.

+Sample snap procedure for CAA
See snap usage for all available options
Follow these steps to gather the CAA data.
1. Run the snap -r command to remove all previously gathered data.
2. Gather the additional information and put it in /tmp/ibmsupt/testcase. You may need to recreate the testcase directory.
3. Run # snap caa or snap -e
4. Rename the testcase file to adhere to IBM testcase file naming conventions, and then send the file to IBM.

Here are some specific zsnap commands you can use:

    For filesystems: zsnap –FS
    For installation issues: zsnap –INSTALL, or zsnap –NIM
    For LVM issues: zsnap –LVM

Each link gives you the data that is captured and examples for using the command. For completeness there is also:

    zsnap –SAN
    zsnap –NFS
    zsnap –DUMP
    zsnap –TCPIP

As you can see, zsnap is a valuable tool that can help you before you take your problem to IBM Support.

Helpful Links About Event Monitoring

Edit: Still an interesting concept.

Originally posted August 25, 2015 on AIXchange

On Twitter, Chris Gibson linked to this interesting post from Andrey Klyachin:

A colleague asked me, if there is an interface in AIX like inotify in Linux. He has some problem on one of his AIX boxes and wanted to monitor new files in a directory. Of course there is such interface since AIX 6.1 TL6 or AIX 7.1 – it is AHAFS. Not very well known AIX feature, used primarily by new PowerHA 7.1, but not by admins.

If you want to know more about the feature, I would suggest you first to read the IBM documentation. My example is just small practical example how to use the technology, not a manual about it.

The IBM documentation to which Andrey refers brings you to the Introduction to the AIX Event Infrastructure:

The AIX Event Infrastructure is an event monitoring framework for monitoring predefined and user-defined events.

In the AIX Event Infrastructure, an event is defined as any change of a state or a value that can be detected by the kernel or a kernel extension at the time the change occurs. The events that can be monitored are represented as files in a pseudo file system. Some advantages of the AIX Event infrastructure are:

  • There is no need for constant polling. Users monitoring the events are notified when those events occur.
  • Detailed information about an event (such as stack trace and user and process information) is provided to the user monitoring the event.
  • Existing file system interfaces are used so that there is no need for a new application programming interface (API).
  • Control is handed to the AIX Event Infrastructure at the exact time the event occurs.

Further in the documentation, we come to the infrastructure components:

The AIX Event Infrastructure is made up of the following four components:

  • The kernel extension implementing the pseudo file system.
  • The event consumers that consume the events.
  • The event producers that produce events.
  • The kernel component that serve as an interface between the kernel extension and the event producers.

From there, the doc covers setting up the Event infrastructure (which is basically installing bos.ahafs, creating the directory, and mounting it).

The high level view of how the AIX Event Infrastructure works says:

A consumer may monitor multiple events, and multiple consumers may monitor the same event. Each consumer may monitor value-based events with a different threshold value. To handle this, the AIX® Event Infrastructure kernel extension keeps a list of each consumer’s information including:

  • Specified wait type (WAIT_IN_READ or WAIT_IN_SELECT)
  • Level of information requested
  • Threshold (s) for which to monitor (if monitoring a threshold value event)
  • A buffer used to hold information about event occurrences.

Event information is stored per-process so that different processes monitoring the same event do not alter the event data. When a consumer process reads from a monitor file, it will only read its own copy of the event data.

Finally, the monitoring events section offers subsections on creating the monitor file, writing to the monitor file, reading event data, and more.

Relevant to the documentation is this typical workflow.

Now back to Andrey’s post. He’s written a perl script that notifies him when, for instance, someone changes the /root/smit.log file:

The procedure to create a new monitor is relatively simple. We have to create a new directory and to make a new .mon-file in the directory. In the file we write how much information do we need and some other flags. After that we read from the file, when a notification comes.

Let’s say we want to monitor file /root/smit.log and obtain a notification every time it is changed. We go to directory /aha/fs/modFile.monFactory – it is a standard directory for “File modification monitor”, and create a directory root there with mkdir command. Then we create smit.log.mon file in this directory and write CHANGED=YES;INFO_LVL=1 in this file. That’s it! After that the only thing we have to do is to wait, till some information comes.

And to think I found all this from a single tweet.

Check Out IBM Software System Maps

Edit: I still use these all the time

Originally posted August 18, 2015 on AIXchange

Say your site is getting new hardware. One thing you’d want to know is the software versions you should be running on your shiny new boxes.

That’s what makes IBM’s Software System Maps web page worth bookmarking. Here you’ll find software maps for AIX, IBM i, PowerVM VIO servers, SUSE Linux and RedHat Linux. There’s also a link for supported code combinations for HMC and server firmware.

When you select the AIX map, you’re brought to a list of Power systems. Pick a model and a machine type, and you’ll have a choice of configurations, whether you’re looking at virtual clients or clients that have access to physical I/O cards. For instance, when I selected the 8284-22A (S822) and all I/O, I found out which AIX versions were supported and at what levels. (AIX 7100-01-10, 7100-02-05 and 7100-03-03 are the supported base levels, but 7100-01-10, 7100-02-06 and 7100-030-04 are recommended. The same information is provided for AIX 6.1; however, AIX 5.3, AIX 5.2, AIX 5.1, AIX 4.3.3 and AIX 3.2.5 aren’t supported on this hardware.) The bottom of this page contains links to the fix level recommendation tool, Fix Central, end of support dates for AIX, etc.

I should stress that you don’t need new brand new hardware to make use of this tool. The AIX map supports older systems, including RS/6000s running AIX 4.3.3, AIX 5.1, etc. So if you’re still using those systems (perhaps you’re running some sort of technology museum?), you too can benefit from this capability.

And, as I mentioned, you can also do VIO server software mapping. Just select a system and find the VIO server versions that are verified, and the versions that are recommended. You can also see the versions that haven’t been verified to run on the hardware you’re interested in.

Clicking on the supported code combinations link brings you to the POWER code matrix page:

System Firmware is delivered as a Release Level or a Service Pack. Release Levels support the general availability (GA) of new function or features, and new machine types or models. Upgrading to a higher Release Level is disruptive to customer operations. IBM intends to introduce no more than two new Release Levels per year. These Release Levels will be supported by Service Packs. Service Packs are intended to contain only firmware fixes and not to introduce new function. A Service Pack is an update to an existing Release Level.

Note: Installing a Release Level is also referred to as upgrading your firmware. Installing a Service Pack is referred to as updating your firmware. For HMC-managed systems at or beyond System Firmware Release Level 230 (available May 2005), Service Pack updates can be concurrently installed. Concurrent installation minimizes or eliminates downtime needed to apply firmware patches. IBM cannot guarantee that all Service Packs can be installed concurrently, however, our goal is to provide non-disruptive installation of Service Packs.

Browse around the site. It’s kept up to date and has good reference material.

Creating Adapters with the HMC Enhanced GUI

Edit: Sometimes I miss the old interface.

Originally posted August 11, 2015 on AIXchange

I was recently playing around with the enhanced HMC GUI, using the new interface to look at an old test machine.

The test box had crash and burn LPARs that had been created over time. In some cases, I’d spin up a test LPAR and select the VIO server option that allowed for any client partition to connect to the virtual adapter I’d created. This was to allow greater flexibility going forward — it wouldn’t be necessary to  re-create the adapter; I’d just assign another client LPAR to the existing one. If I hadn’t yet built the client LPAR definitions, I’d set up a bunch of server adapters ahead of time for later use with the crash and burn client LPARs.

On the new HMC software version, when I selected the manage PowerVM option, some of the disks and adapters weren’t appearing in the PowerVM Virtual storage adapter view. Since I could see them using the classic HMC view, I figured it was a bug and opened a ticket with IBM Support.

After some back and forth, support sent me this interesting information:

Server Adapters in HMC can be created with the option of “Any” for the Client Adapter. Such adapters are not supported by REST or by the Enhanced+ GUI. This is by design. It is not possible to know to which client adapter it is connected to. The Server adapter mapping could possibly change during the reboot of the logical partition. The REST and Enhanced+ GUI do not provide the option of creating a Server adapter with the “Any” option. The usage of “Any” is not recommended when creating Server Adapters, though it’s possible in Classic GUI.

That’s right. Adapters set to “Any” won’t display in the enhanced GUI option.

This explanation made sense once I thought about it, but since it took me awhile to get this answer, hopefully I can save you some time and trouble by passing it along here. Then again, hopefully you aren’t creating server adapters without assigning them to clients in the first place, which would save you from ever having to deal with this issue at all. Going forward I know I’ll be more careful when assigning virtual adapters on my test machines.

The pdump Script

Edit: Do you know about this now?

Originally posted August 4, 2015 on AIXchange

Do you have a hung process on your AIX machine? Do you need more information about a running process? These are just two instances where the pdump script could help you:

The pdump script extracts information from the running process by using the kdb command and other AIX tools. This script can be especially helpful if you suspect the process is in a hung state or is suspected to be in an infinite loop.

The pdump.sh data gathering process includes:

  • svmon
  • proctree
  • thread information
  • user area information
  • lock information
  • current stack information

In order to use the script
Step 1. Determine what process is hung.

If you suspect a process is hung, first find its <pid#> using the ps command. Then, using proctree <pid#>, check if that process has child processes that it might be waiting on. If the parent process is waiting on a child process, then you should first try running pdump.sh on the last child found in the proctree.

Step 2. Run pdump on the process.
pdump.sh <pid#>
Where <pid#> is the process id that is suspected to be hung or looping.

If you cannot determine which specific process is hung, you may simply run pdump.sh against PID 1 (the init process) as a start point for investigation:
    pdump.sh 1

Tips
It is often helpful to run two pdumps on the same process at 60 second intervals. This will allow IBM AIX support center representatives to verify if that process made any progress in that time frame. Capture this information and include it in the test case you upload to IBM.

Try running pdump with only the -l flag (long mode) unless instructed by your support representative to do otherwise. The -d flag (call dbx) might fail to attach to the process when it is hung in kernel mode.

You can copy the script from here. Change the permissions to 700 before running it for the first time.

Have you tried this tool? Were you even aware of it?

Identifying SAN Devices

Edit: Still good stuff.

Originally posted July 28, 2015 on AIXchange

Anthony English recently tweeted about world wide port names (WWPNs), linking to this series of slides last updated by Anthony Vandewerdt in 2013. When working with SAN zoning storage devices and servers, it’s important to identify every piece of hardware. For those who work with IBM storage devices, determining the WWPN ranges used by each storage model is much simpler, thanks to the IBM Storage WWPN Determination guide. Vanderwerdt’s two-year-old slides are version 6.6 of the guide. When version 6.5 came out in 2012, he posted this explanation:

If this guide is new to you, its purpose it to let you take a WWPN and decode it so you can work out not only which type of storage that WWPN came from, but the actual port on that storage. People doing implementation services, problem determination, storage zoning and day-to-day configuration maintenance will get a lot of use out of this document. If you think there is an area that could be improved or products you would like added, please let me know.

It is also important to point out that IBM Storage uses persistent WWPN, which means if a host adapter in an IBM Storage device has to be replaced, it will always present the same WWPNs as the old adapter. This means no changes to zoning are needed after a hardware failure.

The document starts by defining WWPNs and world wide node names (WWNNs). It then lists the WWNN/WWPN ranges used by IBM products:

A WWNN is a World Wide Node Name; used to uniquely identify a device in a Storage Area Network (SAN). Each IBM Storage device has its own unique WWNN. For DS8000, each Storage Facility Image (SFI) has a unique WWNN. For SVC and Storwize V7000, each Node has a unique WWNN.A WWPN is a World Wide Port Name; a unique identifier for each Fibre Channel port presented to a Storage Area Network (SAN). Each port on an IBM Storage Device has a unique and persistent WWPN.
 
     – IBM System Storage devices use persistent WWPN. This means if an HA (Host Adapter) in an IBM System Storage Device gets replaced, the new HA will present the same WWPN as the old HA. IBM Storage uses a methodology whereby each WWPN is a child of the WWNN. This means that if you know the WWPN of a port, you can easily match it to the WWNN of the storage device that owns that port.
 
     – A WWPN is always 16 hexadecimal characters long. This is actually 8 bytes. Three of these bytes are used for the vendor ID. The position of the vendor ID within the WWPN varies based on the format ID of the WWPN. To determine more information we actually use the first character of the WWPN to see which format it is… .

Vanderwerdt also links to this list of companies that are registered with IEEE.

Share Your Product Ideas with IBM

Edit: Some links no longer work.

Originally posted July 21, 2015 on AIXchange

What new features and capabilities would you like to see added to AIX? How can you share your ideas with IBM?

In the past, customers could submit a design change request (DCR). This is now done with a request for enhancement (RFE).

Read more about RFEs here:

The following products are now available on the IBM RFE Community. This RFE Community update gives you the ability to enter additional Requests for Enhancements (RFEs), allowing for better communication between you and developers on more platforms and servers.

  • IBM AIX: The AIX operating system is an open standards-based, UNIX operating system that allows you to run the applications you want on Power Systems servers.
  • PowerHA: PowerHA SystemMirror for AIX technology is a high availability clustering solution for data center and multisite resiliency. It is designed to protect business applications from outages of virtually any kind, helping ensure round-the-clock business operations.
  • PowerSC: IBM PowerSC provides a security and compliance solution optimized for virtualized environments on Power Systems servers running the AIX operating system.
  • PowerVM VIOS: PowerVM provides a secure and scalable server virtualization environment for AIX and Linux applications built upon the advanced RAS features and leading performance of the Power Systems platform.

For details, check out these RFE FAQs and this list of status values and definitions:

The status of a request depends on:

  • Where the request is in our development lifecycle
  • Whether we are still considering the request
  • Whether we have approved it and plan to deliver it
  • Whether we have declined it.

Finally, here are a couple of videos. This roughly 8 minute video tells you how to watch for and receive RFE notifications. This longer video (it’s about 20 minutes) tells you how to submit, view and send out notifications on RFEs.

If you have an idea for enhancing AIX or any IBM product, or if you just want to discover what other users have suggested, why not engage in the process?

On a personal note, July 16 was the 8-year anniversary of AIXchange.

Over the years I’ve enjoyed hearing from the many readers who’ve told me that this feature has been educational or otherwise beneficial. Some of these readers have become good friends.

Through eight years, I’ve written in the neighborhood of 400 blog posts. Occasionally I’ll Google a term and be directed to something I wrote some years back, something I’d forgotten about. Besides jogging my memory, this often serves as reference material for topics I’m currently working on.

Although technology changes, I find that there’s still a wide audience for AIX- and Linux-oriented information, and I plan to continue to provide this into the future.

As always, if you have topics you would like to see covered, just drop me a line.

HMC Connectivity Security

Edit: Link still works.

Originally posted July 14, 2015 on AIXchange

This white paper, published in April, examines HMC 830 connectivity security:

This document describes data that is exchanged between the Hardware Management Console (HMC) and the IBM Service Delivery Center (SDC). In addition it also covers the methods and protocols for this exchange. This includes the configuration of “Call Home” (Electronic Service Agent) on the HMC for automatic hardware error reporting. All the functionality that is described herein refers to Power Systems HMC version V6.1.0 and later as well as the HMC used for the IBM Storage System DS8000.

The document covers HMC connectivity methods, with the caveat that “starting in 2015, new products will no longer have outbound VPN connectivity capabilities.”

Before the HMC tries to connect to the IBM servers, it first establishes an encrypted VPN tunnel between the HMC and the IBM VPN server gateway. The HMC initiates this tunnel using Encapsulated Security Payload (ESP, Protocol 50) and User Datagram Protocol (UDP).  After it is established, all further communications are handled through TCP sockets, which always originate from the HMC.

For the HMC to communicate successfully, the client’s external firewall must allow traffic for protocol ESP and port 500 UDP to flow freely in both directions. The use of SNAT and masquerading rules to mask the HMC’s source IP address are both acceptable, but port 4500 UDP must be open in both directions instead of protocol ESP. The firewall may also limit the specific IP addresses to which the HMC can connect.

Although modem connectivity is still supported for some systems, its use is being deprecated and the support has been removed from POWER8. IBM recommends the usage of internet connectivity for faster service, due to the size of error data files that may be sent to IBM Support. …

Configuring the Electronic Service Agent tool on your HMC enables outbound communications to IBM Support only. Electronic Service Agent is secure, and does not allow inbound connectivity. However, HMC can configure customer controlled inbound communications. Inbound connectivity configurations allow an IBM Service Representative to connect from IBM directly to your HMC or the systems that the HMC manages. The following sections describe two different approaches to remote service. Both approaches allow only a one time use after enabling.

Reasons for connecting to IBM
* Reporting a problem with the HMC or one of the systems it is managing back to IBM
* Downloading fixes for systems the HMC manages (Power HMC only)
* Reporting inventory and system configuration information back to IBM
* Sending extended error data for analysis by IBM
* Closing out a problem that was previously open
* Reporting heartbeat and status of monitored systems
* Sending performance and utilization data for system I/O, network, memory, and processors (Power HMC only)
* Transmission of live partition mobility (LPM) data (Power HMC only)
* Track maintenance statistics (Power HMC)
* Transmission of deconfigured resources (Power HMC only).

In addition, there’s a list of the data that is sent to IBM, including filenames and the information they contain:

When Electronic Service Agent on the HMC opens up a problem report for itself, or one the systems that it manages, that report is called home to IBM. All the information in that report gets stored for up to 60 days after the closure of the problem. Problem data that is associated with that problem report is also called home and stored. That information and any other associated packages will be stored for up to three days and then deleted automatically. Support Engineers who are actively working on a problem may offload the data for debugging purposes and then delete it when finished. Hardware inventory reports and other various performance and utilization data may be stored for many years.

There are also sections that cover multiple HMCs and the IP addresses and ports that IBM uses for connectivity.

As always I recommend that you take the time to read the whole document.

A Tool for SAN Troubleshooting

Edit: Still good stuff.

Originally posted July 7, 2015 on AIXchange

Are you looking for more information about your SAN? Do you want to learn about the LUNs that have been presented to your host? Maybe you want to be able to compare what your machine sees now as opposed to what it was seeing on the SAN.

IBM has a SAN troubleshooting tool that can help you. It’s called devscan:

The purpose of devscan is to make debugging storage problems faster and easier. Devscan does this by rapidly gathering a great deal of information about the Storage Area Network (SAN) and displaying it in an easy to understand manner. Devscan can be run from any AIX host, including VIO clients, or from a VIOS.

The information devscan displays is gathered from the SAN itself or the device driver, not from ODM, with exceptions described in the man page. The data is therefore guaranteed to be current and correct.

In the default case, devscan is unable to change any state on the SAN or on the host, making it safe to run even in production environments. In all cases, devscan is safer to run than cfgmgr, because it cannot change the ODM. Some of the optional commands devscan can use are able to cause a state change on the SAN. Details are provided in the man page.

Devscan can report a list of all available target devices and LUNs
For each LUN, devscan can report
· ODM name and status
· PVID, if there is one
· Device type
· Capacity and block size
· SCSI status
· Reservation status, both SCSI-2 and SCSI-3
· ALUA status
· Time to service a SCSI Read

Devscan scans a set of SCSI adapters, and then issues a set of commands to a set of targets and LUNs on those adapters. In the default case, devscan finds every Fibre Channel, SAS, iSCSI, and VSCSI adapter in the system and traverses each one. It issues SCSI Report LUNs and Inquiry commands to every target and LUN it finds. The set of adapters to be scanned, targets and LUNs to be traversed, and commands to be issued may be controlled with several of the optional flags.

Usage examples
1. To run against all SCSI adapters with the default command set (Start, Report LUNs, and Inquiry):
    devscan
2. To run against only the fscsi3 adapter and gather SCSI Status from all attached devices:
    devscan -c7 –dev=fscsi3
3. To determine what the NPIV client using WWPN C0507601A673002A can see through all Fibre Channel adapters on the VIOS (e.g., because the client cannot boot):
    devscan -t f -n C0507601A673002A
4. To run devscan in machine-parseable mode using “::” as the field delimiter:
    devscan –concise –delim=”::”
5. To run devscan against only the VSCSI adapters in the system and write the output to /tmp/vscsi_scan_results:
    devscan -tv -o /tmp/vscsi_scan_results
6. To scan only the storage port 5001738000330193:
    echo “f|||5001738000330193” | devscan –whitelist=-
7. To scan only the storage at SCSI ID 0x010400:
    echo “f|010400” | devscan –whitelist=-
8. To scan only for hdisk15:
    echo “hdisk15” | devscan –whitelist=-
9. To scan for all targets except the one with WWNN 5001738000330000:
    echo “f||||5001738000330000” | devscan –blacklist=-
10. To scan for an iSCSI target at 192.168.3.147:
    echo “192.168.3.147” | devscan –iscsitargets=-
11. To check the SCSI status of hdisk71 on all the Fibre adapters in the system and send the output to /tmp/devscan.out:
    echo “hdisk71” | devscan –whitelist=- -o /tmp/devscan.out -tf -c7 -F

1. Processing FC device:
    Adapter driver: fcs4
    Protocol driver: fscsi4
    Connection type: none
    Local SCSI ID: 0x000000
    Device ID: df1000fe
    Microcode level: 271102

The connection type of “none” indicates this adapter has never had a link.
2. Processing FC device:
    Adapter driver: fcs0
    Protocol driver: fscsi0
    Connection type: fabric
    Link State: down
    Current link speed: 4 Gbps
    Local SCSI ID: 0x180600
    Device ID: 77102224
    Microcode level: 0125040024

The link state of “down” indicates this adapter had a link up since the last time it was configured, but does not currently.
3. Nameserver query succeeded, but indicated  no targets are available on the SAN. This means the adapter’s link to the switch is good, but no storage is available, typically because the storage has unexpectedly left the SAN or because it was not zoned to this host port.

4. Processing iSCSI device:
    Protocol driver: iscsi0

    No targets found
    Elapsed time this adapter: 0.001358 seconds

For non-Fibre Channel devices, there is no name server, so the no-targets condition looks like this.

5. 00000000001f7d00 0000000000000000
    START failed with errno ECONNREFUSED

Devcsan is able to reach this device, so the host is connected to the SAN and the nameserver is reporting it, but we are not able to log in to the device. This is an end device problem.

6. Vendor ID: IBM Device ID: 2107900 Rev: 5.90 NACA: yes
PDQ: Not connected PDT: Unknown or no device
Dynamic Tracking Enabled
TUR SCSI status:

Check Condition (sense key: ABORTED_COMMAND;
ASCQ: LOGICAL UNIT NOT SUPPORTED)
ALUA-capable device
Report LUNs failed with errno ENXIO
Extended Inquiry failed with errno ETIMEDOUT
Test Unit Ready failed with errno EIO

Other usage examples can be found on the website. Download devscan and follow these installation instructions:

1. Download the package to your machine.
2. Uncompress and extract the archive. The binary and man page are placed in, /usr/local/bin and /usr/share/man/man1/, respectively, and are ready for use.

Here’s some of the output that I saw on a test machine:

    Running on host: vio1

    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing FC device:
        Adapter driver: fcs0
        Protocol driver: fscsi0
        Connection type: fabric
        Link State: up
        Local SCSI ID: 0x010000
        Local WWPN: 0x10000090fa535192
        Local WWNN: 0x20000090fa535192
        Device ID: 0xdf1000e21410f103
        Microcode level: 00010000020025200009

    SCSI ID LUN ID           WWPN             WWNN
    ———————————————————–
    0a0600  0000000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk14  Path:            0  VG:       None found
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    0a0600  0001000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk15  Path:            0  VG:       None found
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    0a0600  0002000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk16  Path:            0  VG:    caavg_private
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    2 targets found, reporting 20 LUNs,
    20 of which responded to SCIOLSTART.
    Elapsed time this adapter: 00.391183 seconds

Did you know this tool existed? Have you used it? What did you think?

A Message Worth Repeating

Edit: Some links no longer work.

Originally posted June 30, 2015 on AIXchange

I needed a bigger vehicle. Given my work with Boy Scouts, I spend considerable time on the road, hauling boys and their camping gear.

I wanted something big enough to comfortably contain eight or nine people and capably transport a trailer of supplies. I also wanted something that was reliable, well-maintained and a good value. I knew it’d take some time. Decently priced used vehicles like that don’t become available every day, and they typically sell within hours of being advertised. But after a few months of searching craigslist, I found a Chevy Suburban that fit my needs and requirements.

Many people are understandably apprehensive about car-buying. How do you know you’re getting what’s being advertised? There are options, services like Carfax that allow you to see different aspects of a used vehicle’s history. A few individual owners may also keep documentation of the work done to their vehicles. Even if you’re not a car person, you can have a trusted mechanic examine a used vehicle for you. While all of this is reassuring, you must still be alert for individual sellers or dealerships that may try to pawn off their problem by concealing costly issues. I certainly have no desire to own a high-mileage, poorly maintained used car.

The same principle applies to our computer hardware. This is something I’ve discussed often over the years. Of course I’m hardly the only one. Anthony English explores this issue in this recent posting. He even uses a car maintenance analogy in the second paragraph. His point? We must maintain our hardware and software.

Only a few weeks ago I wrote about the need to keep current on firmware and OS patches. Customers shouldn’t skip these updates or miss out on other enhancements that are made available on a regular basis.

In short, it pays to be proactive. Plus, it’s fairly simple now. By keeping up with AIX patches, as newer generations of POWER hardware come out, much of the enablement is already loaded into your operating system. Upgrading to new hardware can be as easy as performing a live partition mobility operation to migrate to your new equipment.

Related resources: David Tansley discusses log file maintenance in this article. I wrote about the value of keeping IBM hardware maintenance contracts on your gear here.

I’m sure we can all agree that, like our cars, our machines require regular hardware and software service.

By the way, so far so good with my Suburban. I took care of some minor issues after I got it, but I’ve already had it on a few campouts and it’s been great. You can bet that I plan to continue maintaining it to protect my investment, drive the boys back and forth to campouts safely, and enjoy it for years to come. Likewise, our machines are worth the same effort and investment.

A Look at HMC 8.2 (and Beyond)

Edit: Some links no longer work.

Originally posted June 23, 2015 on AIXchange

When you upgrade your HMC to Version 8.2, there’s a new “tech preview” option that’s intended to give you a feel for the direction that the HMC interface is heading. One of the big complaints heard from those new to POWER hardware and VIOS is how complicated it can all be to learn. I’m seeing a real effort being made to add greater functionality to the GUI so non-legacy POWER users can more easily adopt the platform. 

Two webinars have been held on this topic. Here is the Nigel Griffiths presentation (check out the slides and watch the replay); and here’s one from IBM developerWorks (slides and replay). 

The following information is borrowed from the slides. Watch both presentations to get a better feel for the technology. 

This new code runs best on HMC hardware CR6 or later. As for memory, you can try to get by with 4 GB but 8 GB recommended (CR7/8 start at 8GB). 

You can only get this to work with POWER6, POWER7 or POWER8 servers (no POWER5). 

What is a Technical Preview? It is there for:           

• Evaluation purposes           

• Technical familiarity           

• Learning and feedback 

Can you use it in production? Is it “supported”?

Answer : Yes and No 

Not allowed to raise a PMR           

• The PMR response would be “use the Classic version”           

• If you can reproduce the issue there, raise a PMR           

• But you can get support via the Forum           

– http://tinyurl.com/HMC8-Tech-Preview-Forum           

• DeveloperWorks pages Feedback on the Enhanced + Tech preview user interface           

• Developers seem to check for questions daily. 

The charts include some nice slides that show the options available in the various versions: classic, enhanced, or tech preview. I’ve used each version, though I keep going back to the classic version as that’s what I’m most familiar with, Still, knowing where things are headed, I’m making the effort to try the new code. 

There’s slide that shows the new HMC learning curve:            

1. Oh heck! What the dickens is this about. I can’t do this now!           

2. Oh nuts! I can’t find any thing!           

3. Oh! Darn . . . it’s got to be here somewhere!           

4. Oooo! That was cool!           

5. I wonder what that button does? Wow!           

6. Hey, I seem to be getting the hang of this now!           

7. Yep! This is workable.           

8. I have 5 minutes. Let’s try something I have never done before.           

9. When they get this working a bit faster – I will use it. 

For what it’s worth, I think I’m somewhere between 3 and 5. How about you? Do you have the new code loaded? Are you using it? Where are you on this scale?

A Docker Primer

Edit: Some links no longer work.

Originally posted June 16, 2015 on AIXchange

Lately I’ve been reading about Docker, and it seems to keep coming up everywhere I look. If you haven’t heard of it, it’s an “open platform for developers and sysadmins to build, ship, and run distributed applications.” Here’s more from Docker’s website:

Why do developers like it?    

“With Docker, developers can build any app in any language using any toolchain.   “Dockerized” apps are completely portable and can run anywhere — colleagues’ OS X and Windows laptops, QA servers running Ubuntu in the cloud, and production data center VMs running Red Hat.” 

Of course I must take this with a grain of salt, since Docker doesn’t support x86 applications running on POWER8 systems or vice versa. Still, it’s an interesting concept, and Docker does fully support running POWER8 applications on other POWER8 systems — all you need is an OS that has Docker installed. I’ll get to that in a bit.

Why do sysadmins like it?    

“Sysadmins use Docker to provide standardized environments for their development, QA, and production teams, reducing “works on my machine” finger-pointing. By “Dockerizing” the app platform and its dependencies, sysadmins abstract away differences in OS distributions and underlying infrastructure.”     

How is this different from Virtual Machines?    

“Each virtualized application includes not only the application – which may be only 10s of MB – and the necessary binaries and libraries, but also an entire guest operating system – which may weigh 10s of GB.”     

Docker    

“The Docker Engine container comprises just the application and its dependencies. It runs as an isolated process in userspace on the host operating system, sharing the kernel with other containers. Thus, it enjoys the resource isolation and allocation benefits of VMs but is much more portable and efficient.” 

To learn more about Docker, I turned to Twitter and found this link to five videos. While some are rather long, they’re all informative:     

“When you’re interested in learning a new technology, sometimes the best way is to watch it in action—or at the very least, to have someone explain it one-on-one. Unfortunately, we don’t all have a personal technology coach for every new thing out there, so we turn to the next best thing: a great video.     

First up, if you’ve only got five minutes (well, technically seven and a half), watch this. At Opensource.com’s lightning talk series last fall, Docker contributor Vincent Batts of Red Hat gave a great overview of what Docker is, what containers are, and how these technologies are changing the way system administrators and developers are working together to deploy applications in a modern datacenter.     

Now, you understand the concept, so let’s take a slightly deeper dive. Docker founder and CTO Solomon Hykes takes you beyond the basics of containers and into how Docker works, what problems it solves, and some real-world demos.” 

I’ve found that it’s simple to get Docker working on Power systems using Ubuntu 15.04. Once I had Ubuntu running on my system (a trivial process consisting of downloading the .iso to my virtual media repository, creating a Linux LPAR on a POWER8 machine and installing the .iso image), I ran:             

apt-get install docker.io 

For more on the installation process, see this piece from IBM developerWorks

While I’m at it, developerWorks has two other related resources — this how-to on using Ubuntu Core with Docker, and this more general Docker write-up:     

“Docker is an open-source container engine and a set of tools to compose, build, ship, and run distributed applications. The Docker container engine provides for the execution of applications composed as Docker images. Docker hub and Docker private registries are services that enable sharing of Docker images. The Docker command-line tool enables developers to build Docker images, to work with Docker hub and private registries, and to instantiate Docker images as containers in a Docker container engine. The relatively small size of Docker images—compared with VM images—and their portability enable developers to move applications seamlessly between test, quality assurance, and production environments, increasing the agility of application development and operations.” 

For me it was pretty interesting to fire up a Docker image of Fedora and run it on my Ubuntu machine. As far as the workload knows, it’s running on Fedora, but under the covers Ubuntu is running on a POWER8 machine. Of course nothing beats hands-on experience, but if you’re familiar with the concept of WPARs, Docker shouldn’t be hard to grasp. 

Even if you don’t intend to run it in production any time soon, I believe that Docker is worth exploring. Again, downloaded Power images and POWER8 “Dockerized” applications run on Docker just fine. It’s another interesting environment in which to work. So is Docker in your plans?

Fixing RMC Connections to the HMC with 8.8.2 SP1

Edit: Information may still be useful, although I doubt anyone is running this version of HMC code anymore.

Originally posted June 9, 2015 on AIXchange

Recently after upgrading to 8.8.2 SP1, I found my HMC was unable to communicate via RMC to my client LPARs. Though this document helped, when I ran the lspartition -dlpar command, I got this error message:             

Can’t start local session rc=2! 

The document notes that fix it commands run as root on the Management Console, and gives these specific commands to run as root:


            /usr/sbin/rsct/install/bin/recfgct
            /usr/sbin/rsct/bin/rmcctrl –p 

Of course the problem is you can’t run these commands if you can’t become root. So I contacted IBM Support to get my pesh passwords, and received this email: 

Thank you for contacting IBM.

I understand that HMC is displaying error “Can’t start local session rc=2!” when you run”lspartition -dlpar”.

To resolve this issue, please login to your HMC as hscpe, get root access, and run:

            /usr/sbin/rsct/install/bin/recfgct

Once you have done this, the “lspartition -dlpar” command should show current RMC connection status.

If you do not have the hscpe user on your HMC, you can create it with:

            mkhmcusr -u hscpe -a hmcpe -d “ibm”

If the user already exists, but you do not have the password, you can reset it with:

            chhmcusr -u hscpe -t passwd

You can also reset the root password with:

             chhmcusr -u root -t passwd

Once you login to the HMC as hscpe, run:

            pesh <hmc_serial_number>

You will be prompted for the pesh password which we have to generate using the HMC serial number is listed in the SE field with “lshmc -v”.

Enter the pesh password.

This will bring you to a prompt where you can run:

            su – root
            Password: <enter_root_password>

You can then run:

            /usr/sbin/rsct/install/bin/recfgct

Wait a few minutes, then run:

            lspartition -dlpar

You should not get the local session error. If you have issues with the RMC connection not being established for the LPARs, please let us know so that we can continue assisting with standard DLPAR troubleshooting procedures. 

In my case this was all I needed to do. Everything started working normally for me. 

Incidentally, since writing this, I came across someone else with the same issue. Hopefully as more folks get this information out here, more of us will have an easier time dealing with this problem. 

Getting Volume Group Info

Edit: Link no longer works.

Originally posted June 2, 2015 on AIXchange

In environments with machines containing many volume groups and filesystems, we want easy ways of manipulating that information. There’s always a need to know to know which filesystem is in which volume group. If you want to grow the size of a filesystem, you are going to want to know which volume group it is in so that you will be able to check if that volume group has free space available in it or not. 

Brian Smith has a useful post about getting this type of information

Quick tip: List details of all volume groups with lsvg on AIXThe “lsvg” command has a handy “-i” option, which the man page says, “Reads volume group names from standard input.” This brief description doesn’t explain how useful this option can be. 

If you run “lsvg” and pipe the ouput to “lsvg -i” (i.e., “lsvg | lsvg -i”) it will list the volume group information for every volume group on the system. You can also use other lsvg options such as “-l” to list all of the LV’s/Filesystems from every volume group: “lsvg | lsvg -li,” 

This is an excellent way to gather LVM [Logical Volume Manager] information from your system quickly and easily. Another way to use lsvg is to incorporate xargs. As noted in its Wikipedia entry, “in many other UNIX-like systems, arbitrarily long lists of parameters cannot be passed to a command, so xargs breaks the list of arguments into sublists small enough to be acceptable.”  

So, for example this:            

lsvg -o | xargs -n1 lsvg –l is similar to this:            

lsvg | lsvg -li 

Likewise, this:            

lsvg -o | xargs -n1 lsvg 

is similar to this:            

lsvg | lsvg –i 

What methods do you use to examine volume group information?

Updating System Firmware

Edit: Some links no longer work.

Originally posted May 26, 2015 on AIXchange

If you’re new to IBM Power Systems, you’re new to upgrading the HMC (see here and here). Furthermore, you’re new to system firmware updates. I’ve previously discussed firmware, and IBM Systems Magazine has other good articles about it (here and here). There’s also this step-by-step guide to updating your system firmware

“IBM Power Systems firmware update, which is often referred to as Change Licensed Internal Code (LIC) procedure, is usually performed on the managed systems from the Hardware Management Console (HMC). Firmware update includes the latest fixes and new features. We can use the Change Licensed Internal Code wizard from the HMC graphical user interface (GUI) to apply updates to the Licensed Internal Code (LIC) on the selected managed system.

We can select multiple managed systems to be updated simultaneously. The wizard also allows us to view the current system information or perform advanced operations. This tutorial provides the step-by-step procedure for the IBM Power Systems firmware update from the HMC command line, and the HMC GUI and is targeted for system administrators.

This step-by-step instructions can prepare the newbie for what needs to be done and how it could be done to stay on to the latest firmware level all the time. When you purchase a new hardware, the best [practice] is to upgrade all the firmware to the latest level. 

This tutorial provides the following information: 

-Current firmware details 

-Different kinds of code download and update methods 

-Steps to obtain the relevant firmware code updates or releases from the IBM FixCentral website 

-Steps to update the firmware concurrently using DVD media, that is, the fixes that can be deployed on a running system without rebooting partitions or performing an initial program load (IPL) within a specific release 

-Steps to update the firmware disruptively, that is, update requiring the system IPL within a specific release 

-Advanced code update options from the Change Licensed Internal Code wizard 

-Steps to upgrade to recent firmware releases disruptively using the File Transfer Protocol (FTP) method 

-Steps to upgrade the firmware disruptively through the IBM Service website to a required level.” 

Keep in mind that being able to actually get your hands on system firmware requires you to have entitlement for your machines. This means you must make sure you can actually get the code you need, preferably before you actually need it. There’s nothing worse than getting all the necessary approvals for a change window and system downtime, only to have to fail the change and reschedule it because you didn’t have the code you needed. 

How do your machines look? Are your HMC, system firmware and device firmware all at their recommended levels? If you aren’t sure what levels they should be running, don’t forget to check the fix level recommendation tool (FLRT).

When Rebooting, Don’t Forget About the System Profile

Edit: Still good stuff.

Originally posted May 19, 2015 on AIXchange

Recently a customer rebooted some systems that hadn’t been restarted in more than a year. All of the LPARs and the VIO servers were powered off so maintenance could be performed. The customer was able to use live partition mobility to relocate the important LPARs. That left just the dev and test environments. 

Of course, plenty of systems have gone much longer without a reboot, but restarting systems after a year-plus of continuous uptime can be tricky. And in this instance, problems emerged. Someone had done DLPAR operations without then updating the system profile. To make matters worse, the DLPAR operations were related to the VIO server and virtual fibre adapters. When the VIO servers came back up, the system didn’t recognize the dynamically added adapters, and the client LPARs wouldn’t boot.  

Luckily, the customer had hmcscanner output so they could see which adapters were missing based on the information in the client LPAR profiles. However, what should have been a quick restart ended up being a lengthy exercise because the profile information wasn’t in sync with what was actually running.  

How is your systems documentation? When you make a change, do you make sure that the profile has also been updated or saved? 

Along with mksysbbackupios and viosbr, be sure to backup your profile data on the HMC. You never know when someone might have made a change to the running systems and then neglected to backup the profile.

E850 Among the New POWER8 Servers Announced by IBM

Edit: As of the time of the writing the links still worked.

Originally posted May 11, 2015 on AIXchange

On April 28, IBM announced new capabilities for existing POWER8 servers. Today, it’s announcing a new POWER8 server model.

There is a new four-socket 4U server, the Power E850 server, machine type/model 8408-E8E, which will become generally available on June 5, 2015.

The Power E850 will support a maximum of 2 TB of memory, which is a 2X increase over the Power 750, with a statement of direction taking it to 4 TB of memory in the future. The E850 is also redesigned as a 4U server versus the 5U Power 750 and 760 that we had with POWER7+.

The E850 can have two to four processor sockets, up to 3.7 GHz. If you order a system with 2-processors, you will be able to add the third or fourth processors to the systems later if you want to with an MES upgrade, or you can populate your server with extra CPU and memory in advance to take advantage of processor and/or memory Capacity Upgrade on Demand, now available with this model.

The processor options for the E850 include up to 48 cores running at 3.02 GHz, up to 40 cores running at 3.35 GHz, or up to 32 cores running at 3.72 GHz. This server will be part of the small software tier.

The E850 will have 11 PCIe Gen3 slots, one of which will be populated by a LAN adapter of your choice; however, keep in mind that some of the slots and memory options may not be available if you do not populate all of the processor sockets. There are two x16 slots available per installed processor, and three additional x8 slots, on each system. If you populate the processors and choose to activate them later then you will have access to all of the available slots and memory.

The E850 is considered an enterprise server, with enhanced reliability features like Active Memory Mirroring for Hypervisor and Capacity Upgrade on Demand, although it will be a customer set up machine and will not be able to be part of a Power Enterprise Pool, like the E870 and E880 server are today.

The following is a list of E850 supported OS levels that I took from a chart from IBM.

If installing AIX LPAR with any I/O configuration:

  • AIX V7.1 TL3 SP5 and APAR IV68444, or later
  • AIX V7.1 TL2 SP7, or later (planned availability September 30, 2015)
  • AIX V6.1 TL9 SP5 and APAR IV68443, or later
  • AIX V6.1 TL8 SP7, or later (planned availability September 30, 2015)

If installing AIX Virtual-I/O-only LPAR:

  • AIX V7.1 TL2 SP1, or later
  • AIX V7.1 TL3 SP1, or later
  • AIX V6.1 TL8 SP1, or later
  • AIX V6.1 TL9 SP1, or later

If installing VIOS:

  • VIOS 2.2.3.51 or later

If installing the Linux operating system:      

-Big Endian

  • Red Hat Enterprise Linux 7.1, or later
  • Red Hat Enterprise Linux 6.6, or later

SUSE Linux Enterprise Server 11 Service Pack 4 and later Service Packs      

-Little Endian

  • Red Hat Enterprise Linux 7.1, or later
  • SUSE Linux Enterprise Server 12 and later Service Packs
  • Ubuntu 15.04

IBM is also announcing that the Power E880 server will fulfill their earlier statement of direction and now they can max out at four nodes instead of two, and a new 4 GHz clockspeed with up to 48 cores per node. Also fulfilling a previous statement of direction, the E870 now supports the larger memory 128 GB DIMMs, doubling its memory, and both the E870 and E880 may also now attach from one all the way up to four PCIe expansion drawers per node, meaning you could potentially have a four-node E880 with four expansion drawers per node, allowing for up to 192 adapters.

The E880 supports up to 192 cores at 4 GHz, or up to 128 cores at 4.4 GHz, and up to 16 TB of memory with 4 TB per node with 128 GB DIMMs. The E880 has eight PCIe adapter slots per node, along with the capability to have up to 16 PCIe I/O expansion drawers within a four-node system.

The E870 supports up to 8 TB memory, with 4 TB per node with 128 GB DIMMs. It runs up to 80 cores at 4.2 GHz or up to 64 cores at 4.0 GHz. You can order one or two nodes, which are still 5U per node, and you can have up to eight PCIe I/O expansion drawers with a two-node system, which will allow for up to 96 adapters.

With the last announcement, we were limited to either zero or two I/O drawers per E870/E880 node With this new announcement, we will now be allowed to use just one I/O drawer, along with the ability to configure one-half of an I/O drawer if necessary. The E870 supports from a half IO drawer up to four I/O drawers per E870 node; in a two-node E870 the range of I/O drawers is half to eight I/O drawers. The E880 supports from a half I/O drawer up to four I/O drawers per E880 node; in a four-node E880 the range of I/O drawers is half to 16 I/O drawers.

Both the E870 and E880 are part of the Medium software tier.

The I/O drawer can attach to all POWER8 servers:

  • E880 up to 16 I/O drawers
  • E870 up to eight I/O drawers
  • E850 with four sockets populated up to four I/O drawers; with three sockets populated up to three I/O drawers; with two sockets populated up to two I/O drawers
  • S824 with two sockets populated up to two I/O drawers; with one socket populated one I/O drawer
  • S822 with two sockets populated up to one I/O drawer; with one socket populated one-half I/O drawer

The I/O drawers are supported in numerous environments with the exception of OPAL hypervisor and PowerKVM environments.

There are also enhancements to the POWER8 scale-out servers.

There will be a new processor option in the S822 and S822L, you can now order a socket with an 8 core 4.15 GHz option for a total of up to 16 cores running at 4.15 GHz. Keep in mind the maximum memory with this option is 512 GB using the 16 GB or 32 GB DIMMs only. Due to the high cooling requirements there will also be limitations around which I/O cards can be installed in the CEC vs in an external expansion drawer so keep that in mind.

The S814 will allow for up to 1 TB and the S824 and S824L will allow for up to 2 TB of memory with the 128 GB DIMM option. This DIMM is too tall to fit in the 2U machines so do not expect to see it in the 2U servers.

IBM will now support the S824L without a GPU. In the last announcement, the S824L when the GPUs were installed ran only with bare metal Ubuntu. Now you are able to get PowerVM without the GPU.

Every scale out server will be able to use I/O expansion drawers. The maximum slots on a 2U one-socket or a 2U two-socket server with one socket filled becomes 10 slots. A 2U two-socket server with both sockets filled gives us 18 slots. A 4U one-socket or a 4U two-socket server with one socket filled will have 16 slots.

A 4U two-socket server with two sockets filled will have a maximum of 30 slots.

With the S814 rack model, you can get a 900W 110V power without an RPQ. The S814 tower model has always supported 900W power supplies.
There is a statement of direction that IBM will have water cooling available for the S822 and S822L.

As more of the Linux distributions now support little endian running on Power, you will need to upgrade VIOS to 2.2.3.50, which will add support for little endian Linux LPARs running side by side with big endian Linux LPARs, AIX LPARs and IBM i LPARs for the non-L models.

This newer version of VIOS will add another digit to the numbering schemes, which signifies minipacks. This is a cleaner approach to applying PTFs compared to using ifixes where you have to install and uninstall them.

You can read more about the changes to the strategy at ibm.biz/vios_service.

This makes for quite a portfolio of POWER8 servers and options that are now available from IBM, contact your favorite IBMer or business partner, I am sure that you can find the right machine for your environment.

Handy Tool Provides Adapter Info

Edit: Some links no longer work.

Originally posted May 5, 2015 on AIXchange

As I’ve mentioned, I follow several AIX and IBM Power Systems pros on Twitter.

Benoit Creau (@chmod666 on Twitter) is someone you should follow as well. He’s been working on a new tool called lssea that “lists information and details about PowerVM shared Ethernet adapters.” (Go here for the code, and here, here and here for some nice screenshots that will give you an idea of what to expect when you run the code on your system.) 

I found it very easy to set up. After running oem_setup_env on my VIO server, followed by vi lssea, I clicked on the button marked “raw” on the github page. Then I selected everything and cut and pasted it into the lssea file on my VIO server. 

Then I ran chmod u+x lssea, followed by ./lssea. It immediately showed me output listing the server on which it ran. I was also presented with my ioslevel, the version of the lssea code I’m running, and the date.

running lssea on vio1 | IBM,XXXXXXXXX | ioslevel 2.2.3.3 | 0.1c 030915            

SEA : ent9           

number of adapters   : 2           

vlans                : 1 2 3 4 5           

flags                : THREAD LARGESEND 

Again, I encourage you to check out the screenshots. It’s a quick way to determine which real adapters belong to which SEAs as well as find information about control channels, link status, speed, etc. By running it with the –b option, you’ll also get buffer information. As an added bonus, if you want to know how Benoit is getting the information that he’s displaying with lssea, it’s all there because you have access to the source code. 

I love tools like this that take output with which we’re all familiar with and provide useful new functionality.

Why You Should Keep a Local Alt Disk Copy

Edit: Some links no longer work.

Originally posted April 28, 2015 on AIXchange

After upgrading an AIX system, a customer found that they needed to back out of the change. They ended up restoring rootvg from a mksysb. 

Although that’s one way to do it, I don’t recommend it. Of course you should have an mksysb around in case of a disaster, but you should also have a local alt disk copy available. This is true for any type of upgrade, but it’s especially critical for both VIO server and regular AIX upgrades. 

In addition, a disk copy can come in handy if someone accidentally messes up rootvg during regular operations. You can switch your bootlist and reboot to a clean copy of your rootvg rather than try to restore from a backup. 

Here are several articles that explain this in detail.

IBM developerWorks: 

“With IBM Power virtualization, the VIOS plays an important role and all running VIOS client LPARs are fully dependent on the Virtual I/O Servers. In such an environment, updating VIOS to a next fix pack level can be challenging, without taking the system down for an extended period of time and incurring an outage. This can be mitigated by creating a copy of the current root volume group (rootvg) on an alternate disk and simultaneously applying fix pack updates first on the cloned rootvg on a new disk.” 

For example, updating VIOS 1.3.0.0 to 1.3.0.0-FP8, clone a 1.3.0.0 system, and then install updates to bring the cloned rootvg to 1.3.0.0-FP8. This updates the system while it was still running. Rebooting from the new rootvg disk brings the level of the running system to 1.3.0.0-FP8. If a problem with the new VIOS level were discovered, changing the bootlist back to the 1.3.0.0 disk and rebooting the server brings the system back to 1.3.0.0. Another scenario would include cloning the rootvg and applying individual fixes, rebooting the system and testing those fixes, and rebooting back to the original rootvg if there was a problem. 

This article explains the step-by-step procedure for applying the next fix pack level on VIOS by creating a copy of the current rootvg on an alternate disk and simultaneously applying fix pack updates. 

IBM Knowledge Center:

“The alt_disk_copy command allows users to copy the current rootvg to an alternate disk and to update the operating system to the next maintenance or technology level, without taking the machine down for an extended period of time and mitigating outage risk. This can be done by creating a copy of the current rootvg on an alternate disk and simultaneously applying software updates. If needed, the bootlist command can be run after the new disk has been booted, and the bootlist can be changed to boot back to the older maintenance or technology level of the operating system. 

Cloning the running rootvg, allows the user to create a backup copy of the root volume group. This copy can be used as a back up in case the rootvg failed, or it can be modified by installing additional updates. One scenario might be to clone a 5300-00 system, and then install updates to bring the cloned rootvg to 5300-01. This would update the system while it was still running. Rebooting from the new rootvg would bring the level of the running system to 5300-01. If there was a problem with this level, changing the bootlist back to the 5300-00 disk and rebooting would bring the system back to 5300-00. Other scenarios would include cloning the rootvg and applying individual fixes, rebooting the system and testing those fixes, and rebooting back to the original rootvg if there was a problem. 

At the end of the install, a volume group, altinst_rootvg, is left on the target disks in the varied off state as a place holder. If varied on, it indicates that it owns no logical volumes; however, the volume group does contain logical volumes, but they have been removed from the ODM because their names now conflict with the names of the logical volumes on the running system. Do not vary on the altinst_rootvg volume group; instead, leave the definition there as a placeholder. 

After rebooting from the new alternate disk, the former rootvg volume group shows up in a lspv listing as old_rootvg, and it includes all disks in the original rootvg. This former rootvg volume group is set to not vary-on at reboot, and it should only be removed with the alt_rootvg_op -X old_rootvg or alt_disk_install -X old_rootvg commands. 

If a return to the original rootvg is necessary, the bootlist command is used to change the bootlist to reboot from the original rootvg.” 

IBM developerWorks (again): 

“In 2009, I wrote about using alt_disk_copy… to clone your rootvg disks for ease of back-out when doing AIX upgrades or applications upgrades that resided on the rootvg disks. In that article, I did not cover hardware migrations as this was out of scope. In this article, I discuss how this can be achieved. The man page on alt_disk_copy states (by using the ‘O’ option), “Performs a device reset on the target altinst_rootvg. This causes the alternate disk install to not retain any user-defined device configurations. This flag is useful if the target disk or disks become the rootvg of a different system.” 

In a nutshell, this means that any devices that have had their attributes changed, typically by the system administer, are reset to the default value(s).” 

AIX Health Check:

It is very easy to clone your rootvg to another disk, for example for testing purposes. For example: If you wish to install a piece of software, without modifying the current rootvg, you can clone a rootvg disk to a new disk; start your system from that disk and do the installation there. If it succeeds, you can keep using this new rootvg disk; If it doesn’t, you can revert back to the old rootvg disk, like nothing ever happened.” 

And finally, here’s IBM’s “Introduction to Alt_Cloning on AIX 6.1 and 7.1”:

“This guide is intended for those who are new to alternate disk cloning, (or alt_clone for short) and would like to understand the alt_clone process.”

If you would like to learn more about alternate disk cloning, visit the IBM publib website and search on “alt_disk.” 

Do you keep spare LUNs around for your alt_disk copies? If not, why not?

Simplifying PowerVM Management

Edit: Some links no longer work.

Originally posted April 21, 2015 on AIXchange

In December I wrote about a document that covers HMC simplification. Actually, the doc isn’t just about that. It’s also about how IBM is trying to make managing PowerVM easier for customers.

From the document: 

“Managing the IBM PowerVM infrastructure involves configuring its different components, such as the POWER Hypervisor and the Virtual I/O Server(s). Historically, this has required the use of multiple management tools and interfaces, such as the Hardware Management Console (HMC) and the [VIO server] command line interface. 

The PowerVM simplification enhancements were designed to significantly simplify the management of the PowerVM infrastructure, improve the Power Systems management user experience, and reduce the learning ramp for users unfamiliar with the PowerVM technologies. 

This paper provides an overview of the PowerVM simplification enhancements and illustrates how to use the new features available in the HMC to set up and manage the PowerVM infrastructure.” 

Again though, there’s much more to this. How we manage our Power servers will soon undergo some changes. Here’s more from the document: 

“IBM PowerVM is the virtualization solution that enables workload consolidation for AIX, IBM i, and Linux environments on IBM Power Systems. 

The [VIO] Server is a software appliance that works in conjunction with the POWER Hypervisor to enable sharing of physical I/O resources among partitions. Two or more [VIO] Servers are often deployed to provide maximum RAS when provisioning virtual resources to partitions.” 

Next comes an explanation of how IBM is attempting to simplify things. For those of us who’ve worked on the platform for years, it’s pretty straightforward. But if you work with new POWER server users, it’s another matter. A bit of a learning curve will be involved as far as getting it all working and understanding what’s going on under the covers: 

“The PowerVM simplification enhancements encompass architecture changes to the POWER hypervisor and [VIO] Server, new virtualization management features, and new Hardware Management Console (HMC) graphical and programmatic user interfaces to manage the PowerVM infrastructure. The enhancements can be grouped in three main areas:

* Simplified PowerVM infrastructure deployment using templates.           

* Simplified PowerVM management and virtual machine provisioning.           

* Integrated performance and capacity monitoring tool. 

These enhancements are available when managing POWER6, POWER7, and POWER8 Systems using HMC V8.1 or later; except for the performance tool which is available with HMC V8.0 or later. VIO [Server] V2.2.3 or later is recommended for best performance. 

You can access all enhancements by logging in to the HMC Graphical User Interface (GUI) using the Enhanced or Enhanced+ log in option. The performance tool is also available with the Classic log in option. A comparison of the features available with each log in option can be found in the POWER 8 knowledge center.” 

The document offers quite a bit of detail. With the new versions of HMC code that are coming out, we’ll be able to do much more from the GUI. There won’t be as a great a need to configure machines from the VIO command line. Future posts will cover my impressions of the new HMC code, but for now, here’s more from the document: 

“Configuring and managing the PowerVM infrastructure on Power Systems can be accomplished performing the following tasks: 

1. Capturing and editing templates to create custom PowerVM configurations that can be deployed on one or more systems.

2. Deploying a system template to initialize the PowerVM infrastructure.

3. Creating a partition from template to get ready to deploy workloads.

4. Managing PowerVM to modify the virtual network and virtual storage configuration as needed to meet workload demands.

5. Managing partitions to dynamically modify their virtual storage and network resources as needed.

6. Monitoring performance and capacity information to understand resource utilization and identify potential problems.” 

Although I still prefer the command line, I can understand the desire to simplify PowerVM management. I know that for non-UNIX users and those with an IBM i background, things like command completion and shell history can be hard to understand. Rather than have to learn all of this, these folks now have the option to simply manage their machines via point and click: 

“You can view and modify all the [VIO] Server resources and configuration settings by selecting a [VIO Server] in the [VIO Server] overview and accessing the Manage task. The Manage task allows the user to change the processor, memory, physical I/O, and hardware virtualized I/O resources, e.g. logical Host Channel Ethernet Adapters or logical SR-IOV ports, configured to the [VIO Server], either dynamically, that is, while the [VIO Server] is powered on, or when the [VIO Server] is shutdown. 

You can view and modify all partition resources by selecting a partition and accessing the Manage Partition task. You can dynamically change virtual network, virtual storage, and hardware virtualized I/O resources configured to the partition. 

You can access the performance dashboard for a system by selecting a system and choosing Performance. The performance dashboard provides quick visualization of system and partition processor, memory, and I/O resources allocation and utilization… .” 

The PowerVM simplification enhancements available through the [HMC] significantly simplify virtualization management tasks on IBM Power Systems and support a repeatable workload deployment process. 

As with anything in technology, I like to consider how far things have come. It’s pretty incredible to look back on what we could do with early versions of VIO server and HMC code and compare it to the things we can do today. At the same time, as much as I relish looking back, I also look forward to what’s ahead. Where PowerVM is concerned, I’m excited about the future.

Setting Up LPAR Error Notification

Edit: Are you monitoring errors?

Originally posted April 14, 2015 on AIXchange

Your shop has no budget for monitoring software, but you still want to be notified when LPAR errors appear in the AIX error log. You have a few options. 

You could write scripts and periodically run them out of cron. You could set up a master workstation and use it to ssh into each of the machines you want to monitor and run errpt. Or you could set up your machines to send you email notifications of new errors. To do this, you could hard code an email address — either your own, a group address or some generic address (e.g., one that’s monitored by operations or the on-call person) — or you could route the emails to root on the server and set up a .forward file to distribute them to all the addresses you choose to designate. This nice how-to document has the details: 

“Having the pleasure of working across many client accounts, it’s funny to see some of the convoluted scripts people have written to receive alerts from the AIX error log daemon. Early in my AIX career, I used to do the exact same thing, and it involved a whole bunch of SSH keys, some text manipulation, crontab, and sendmail. Wouldn’t it be nicer if AIX had some way of doing all of this for us? Well, you know I wouldn’t ask the question if the answer wasn’t yes! 

Step 1
Create a temporary text file (e.g. /tmp/errnotify) with the following text:

errnotify: 

en_name = “mail_all_errlog” 

en_persistenceflg = 1 

en_method = “/usr/bin/errpt -a -l $1 | mail -s \”errpt $9 on `hostname`\” user@mail.com” 

Step 2
Add the new entry into the ODM.# odmadd /tmp/errnotify 

Step 3
Test that it’s working by adding an entry into the error log.

# errlogger ‘This is a test entry’ 

If required, you can delete the ODM entry with the following command:

# odmdelete -q ‘en_name=mail_all_errlog’ -o errnotify

0518-307 odmdelete: 1 objects deleted. 

To send notifications to multiple addresses, you can do something like ops@company.com,unix@company.com . To update your email address, be sure to do the odmdelete first; if you just rerun the odmadd, it will create multiple entries in the odm. To see the entries on your system use:                

#odmget -q ‘en_name=mail_all_errlog’ errnotify 

One caveat: I know of one environment that processed so much email and logged so many SAN errors that it actually impacted system performance. It would be nice if there was a way to limit the rate that error messages were sent out if a ton of errors were generated for some reason. This whole process assumes you have sendmail working. For those instructions, check out this IBM developerWorks article: 

To start the Sendmail daemon automatically on a reboot, uncomment the following line in the /etc/rc.tcpip file:

# vi /etc/rc.tcpip

start /usr/lib/sendmail “$src_running” “-bd -q${qpi}”

Execute the following command to display the status of the Sendmail daemon:

# lssrc -s sendmail

To stop Sendmail, use stopsrc:

# stopsrc -s sendmail

The Sendmail configuration file is located in the /etc/mail/sendmail.cf file, and the Sendmail mail alias file is located in /etc/mail/aliases.

If you add an alias to the /etc/mail/aliases file, remember to rebuild the aliases database and run the sendmail command with the -bi flag or the /usr/sbin/newaliases command. This forces the Sendmail daemon to re-read the aliases file.

# sendmail -bi

To add a mail relay server (smart host) to the Sendmail configuration file, edit the /etc/mail/sendmail.cf file, modify the DS line, and refresh the daemon:

# vi /etc/mail/sendmail.cfDSsmtpgateway.xyz.com.au

# refresh -s sendmail 

You can use this same method to monitor your VIO servers. 

How are you notified of LPAR errors?

The more Command and vi

Edit: Did you know you could do this?

Originally posted April 7, 2015 on AIXchange

It’s easy to overlook the simple things. For instance, did you know that vi can be invoked from within the more command? 

From “man more”:            

The more command uses the following subcommands:

             h            Displays a help screen that describes the more subcommands.

             v            Starts the vi editor, editing the current file in the current line.

 To try this out, run:

             ‘more /etc/hosts’

Then from inside your more session, type v and you will go into vi. By typing vi, you’re actually inside vi in editing mode.

Once you modify the file and save your changes, exit out to return to your more session and verify that your changes were made.

Be sure to look at some of the other options available in more, such as how to get to the very end or very beginning of a file, or how to skip ahead a particular number of lines. 

While we’re on the subject, here’s a reminder for newer AIX administrators: “set –o vi” gives you easy access to your shell history, along with command completion and other capabilities. 

I could go on about the usefulness of vi, but why not get a vi cheat sheet and see for yourself? I’ll list a couple (here and here), but there are many more online. Just search on “vi cheat sheet” and look at the image results.

More Terrifying Tales of IT

Edit: We see these stories these days when ransomware takes out critical systems.

Originally posted March 31, 2015 on AIXchange

I enjoy reading IT-related horror stories, especially those that hit close to home. For me, the best thing about these stories is figuring out what went wrong and then incorporating those lessons into my own environments. Here are a couple of good reads that I want to share.

First, from Network World:

        “Our response to the outage was professional, but ad-hoc, and the minutes trying to resolve the problem slipped into hours. We didn’t have a plan for responding to this type of incident, and, as luck would have it, our one and only network guru was away on leave. In the end, we needed vendor experts to identify the cause and recover the situation.
        Risk 1: The greater the complexity of failover, the greater the risk of failure.
        Remedy 1: Make the network no more complex than it needs to be.
        Risk 2: The greater the reliability, the greater the risk of not having operational procedures in place to respond to a crisis.
        Remedy 2: Plan, document and test.
        Risk 3: The greater the reliability, the greater the risk of not having people that can fix a problem.
        Remedy 3: Get the right people in-house or outsource it.”

I’ve always said that having a test system is invaluable, but simply having the system available to you isn’t enough. You must also make the time to use it, play with it, blow it up. And you absolutely cannot allow your test box to slowly morph into a production server.

This ComputerWorld article tells an even scarier tale of a hospital that was forced to go back to all paper when its network crashed. Though this incident occurred back in 2002, I believe it’s still relevant reading. Technology today is more reliable than ever, but troubleshooting is a skill we’ll always need.

        “Over four days, Halamka’s network crashed repeatedly, forcing the hospital to revert to the paper patient-records system that it had abandoned years ago. Lab reports that doctors normally had in hand within 45 minutes took as long as five hours to process. The emergency department diverted traffic for hours during the course of two days. Ultimately, the hospital’s network would have to be completely overhauled.
        First, the CAP team wanted an instant network audit to locate CareGroup’s spanning tree loop. The team needed to examine 25,000 ports on the network. Normally, this is done by querying the ports. But the network was so listless, queries wouldn’t go through.
        As a workaround, they decided to dial in to the core switches by modem. All hands went searching for modems, and they found some old US Robotics 28.8Kbps models buried in a closet. Like musty yearbooks pulled from an attic, they blew the dust off them. They ran them to the core switches around Boston’s Longwood medical area and plugged them in. CAP was in business.
        In time, the chaos gave way to a loosely defined routine, which was slower than normal and far more harried. The pre-IT generation, Sands says, adapted quickly. For the IT generation, himself included, it was an unnerving transition. He was reminded of a short story by the Victorian author E.M. Forster, “The Machine Stops,” about a world that depends upon an uber-computer to sustain human life. Eventually, those who designed the computer die and no one is left who knows how it works.
        He found himself dealing with logistics that had never occurred to him: Where do we get beds for a 100-person crisis team? How do we feed everyone?

        Lesson 1: Treat the network as a utility at your own peril.
        Actions taken:
        1. Retire legacy network gear faster and create overall life cycle management for networking gear.
        2. Demand review and testing of network changes before implementing.
        3. Document all changes, including keeping up-to-date physical and logical network diagrams.
        4. Make network changes only between 2 a.m. and 5 a.m. on weekends.

        Lesson 2: A disaster plan never addresses all the details of a disaster.
        Actions taken:
        1. Plan team logistics such as eating and sleeping arrangements as well as shift assignments.
        2. Communicate realistically—even well-intentioned optimism can lead to frustration in a crisis.
        3. Prepare baseline, “if all else fails” backup, such as modems to query a network and a paper plan.
        4. Focus disaster plans on the network, not just on the integrity of data.”

Anyone who’s spent even a few years in our profession has at least one good horror story. What’s yours? Please share it in comments.

Maxmizing IOPS

Edit: Some links no longer work.

Originally posted March 24, 2015 on AIXchange

Recently I listened to a discussion of the differences in input/output operations per second (IOPS) in various workload scenarios. People talked about heavy reads. They talked about heavy writes. They debated whether it was better to use RAID5, RAID6 or RAID10. Things got a little heated.

I came away thinking that I should cover this topic and share some resources with you. For instance, this article provides basic information about physical disks, but also makes some interesting points:

“Published IOPS calculations aren’t the end-all be-all of storage characteristics. Vendors often measure IOPS under only the best conditions, so it’s up to you to verify the information and make sure the solution meets the needs of your environment.

IOPS calculations vary wildly based on the kind of workload being handled. In general, there are three performance categories related to IOPS: random performance, sequential performance, and a combination of the two, which is measured when you assess random and sequential performance at the same time.

Every disk in your storage system has a maximum theoretical IOPS value that is based on a formula. Disk performance — and IOPS — is based on three key factors:

    Rotational speed
    Average latency
    Average seek time

Perhaps the most important IOPS calculation component to understand lies in the realm of the write penalty associated with a number of RAID configurations. With the exception of RAID 0, which is simply an array of disks strung together to create a larger storage pool, RAID configurations rely on the fact that write operations actually result in multiple writes to the array. This characteristic is why different RAID configurations are suitable for different tasks.

For example, for each random write request, RAID 5 requires many disk operations, which has a significant impact on raw IOPS calculations. For general purposes, accept that RAID 5 writes require 4 IOPS per write operation. RAID 6’s higher protection double fault tolerance is even worse in this regard, resulting in an “IO penalty” of 6 operations; in other words, plan on 6 IOPS for each random write operation. For read operations under RAID 5 and RAID 6, an IOPS is an IOPS; there is no negative performance or IOPS impact with read operations. Also, be aware that RAID 1 imposes a 2 to 1 IO penalty.”

Again, that article is focused on physical disks. But I’m also seeing more and more solid state devices (SSDs) being deployed. These charts compare spinning disks to SSDs, and they’re eye-opening. While a 15K SAS drive might see 210 IOPS, an individual consumer grade SSD might see 5,000 or 20,000 IOPS. Disk subsystems like the IBM FlashSystem 840 show 100 percent random 4K reads IOPS of 1.1 million, while a read/write workload might have 775,000 IOPS.

Here’s an interesting tool that lets you configure environments for SSD and physical disk and compare their performance. By moving other variables around, you can model hard drive capacity and estimate workload read /write percentages and drives being used.

What methods do you use when configuring your disk subsystem? Is SSD being deployed in your environment? What RAID levels are you targeting?

Readers Discuss VIOS Installs

Edit: With flash USB drives this is even easier now.

Originally posted March 17, 2015 on AIXchange

Perhaps I wasn’t clear when I explained why NIM is my VIO server installation option of choice. In any event, this reader’s response got me thinking further about the topic:

“I prefer (to) install from the virtual media repository. It is faster with no network issue. Because when you start install always network is not ready yet. And when you are updating if your NIM is not up to date (lpp_source, spot, mksysb, …) you take much time. With Virtual media repository you need just iso image to start install.”

To me, using the virtual media repository in the original scenario (installing a VIO server) presents a chicken-and-egg dilemma. If I could use a virtual media library, the VIO server would already be installed. But because one VIO server can’t be a client of another VIO server, I can’t use a virtual media repository to build a VIO server.

I absolutely agree that a virtual media repository is a great way to go when installing client LPARs. It just doesn’t help with the initial VIO server install.

Another reader shared a different scenario: He was installing IBM i and VIOS. Because he didn’t know how to build a NIM server (and he didn’t have access to AIX media anyway), installing from one was out of the equation. He tried using the HMC to install a second VIO server, but it wasn’t working. It would start to boot, but then the install would blow up. He had IBM Support specialists look at the logs, but they couldn’t find the problem.

Eventually, he arrived at an inelegant solution. He had a split backplane on his server, and was able to install VIOS1 from physical media with no issues. The DVD was attached to controller 1, with no way to swing it over to controller 2. So he added controller 2 from VIOS2 to his VIOS1 profile, and then restarted the VIOS1 profile. When he booted from the DVD, he opted to install VIOS to a disk that was on controller 2. Once the installation completed, he shut everything down, put controller 2 back into the VIOS2 profile, and booted both VIO servers. VIOS2 would have a few extra logical device definitions that were no longer available, but otherwise, everything worked.

While I don’t expect to run into these issues myself, it is nice to know that different installation options exist. You never know when someone else’s solution might get you out of your own jam.

It’s Time to Snuff Out Commodity Servers

Edit: These days it sounds like people are trying to outlaw them. Some links no longer work.

Originally posted March 10, 2015 on AIXchange

During a recent lunch with customers, the topic of smoking came up. Some were talking about smoking hookahs, some were talking about cigars, and some were talking about cigarettes. One of the guys had recently quit smoking. He credited the Internet, which pointed him to information about e-cigarettes.

He said e-cigarettes helped him curtail his nicotine intake, adding that the flavored e-liquids that had a more fruity taste helped him disassociate smoking with the flavor of tobacco. Then, eventually, he just stopping smoking entirely.

Someone said I should write about this, and wondered how I could possibly come up with an analogy that married smoking cessation with some technological topic. It was meant as a joke, but once I gave it some thought, I did make a connection: x86 servers. In the tech world, running Linux on commodity x86 servers is a bad habit that many of us want to break. However, we’ve been doing it for years, and we just can’t seem to stop. Sure, we’ve seen the ads telling us how our lives will be better once we quit, but some of us still can’t find a method that really works.

So has the analogy broken down for you yet? Yeah, me too. Admittedly, better analogies can be made in this case. For instance, when I think about what typically runs on Power systems, I usually imagine huge workloads that require massive amounts of uptime. These critical servers are the backbone of our businesses. Others have compared running Power systems to construction vehicles like giant earth-moving machines. Along those lines, I’ve seen IBM presentations that compared x86 servers and Power systems to bicycles and automobiles.

So would you try to move tons of dirt with a small pickup truck and a shovel? Would you put bicycle tires on a car? Then why do we insist on running the smaller and less critical workloads on slower, less powerful, less robust commodity hardware? Why aren’t we taking advantage of the machines we already have in our data centers, the same machines we trust with our most critical workloads?

We should run Power IFLs, which would enable us to fire up dark cores and memory on our larger machines at an attractive price. We should run Linux on Power with POWER8 scale-out servers with PowerVM or PowerKVM. Using these options, we could wean ourselves off commodity servers, and ultimately dispense with them entirely.

We should be educating ourselves as to why Power is the best choice. Google, Rackspace and others in the OpenPower Foundation are working on data center development around POWER8. Why aren’t you? Didn’t you see this report?

“Newly disclosed scores show Power8 beating Intel’s most powerful server processor, the 18-core Xeon E5-2699v3 (Haswell-EP), on important benchmark tests. Both processors deliver outstanding performance on the SPEC CPU benchmarks, but IBM’s huge advantages in multithreading and memory bandwidth favor Power8 when running larger test suites that more closely reflect real-world enterprise applications.

Overall, the results show that IBM offers a viable high-end alternative to Intel’s market-leading products. Equally important to Big Blue, Power8’s performance is energizing the OpenPower Foundation, an IBM-led alliance that rallies other companies to create a larger hardware and software ecosystem around the processor. IBM is offering Power8 chips to system builders in the merchant semiconductor market and is even licensing the architecture to other processor vendors. So far, the alliance has more than 80 members, including software, system, and semiconductor vendors.

Power8 is IBM’s most powerful microprocessor yet. On the merchant market, it’s available with 8, 10, or 12 CPU cores at maximum clock frequencies of 3.126GHz to 3.758GHz. Compared with its Power7+ predecessor, which is not a merchant product, Power8 offers twice the threads and L2 cache per core, up to 20% more L3 cache, a new L4 cache, up to four times the peak DRAM bandwidth, and twice the per-core SPEC CPU throughput.”

Whether it’s a force of habit or a lack of information, many customers continue to rely upon commodity hardware. Maybe it’s time to take a closer look at what you can do with POWER8.

The Laptop’s Future, Revisited

Edit: I still use my beefy laptop most of the time.

Originally posted March 3, 2015 on AIXchange

A reader had an interesting response to my recent post about the end(?) of desktop and laptop computers. With his permission, I’ll share some of our email exchange:

Hi Rob — Greetings from another dinosaur. For some reasons the comments in your article do not work for me. I think you’ve exposed just the tip of the iceberg. Here are a few more reasons why it is too early to declare Desktops/Laptops dead:

1. Battery, battery and battery again.
Smartphones still demand to be charged every day. Watching movies is fine but using the radio for voice drains the battery in just few hours. Streamed data transfers make things worse. Using the phone for an hour here and an hour there is fine but using it eight hours a day ultimately ties one to the power cord.

One could also add that removing the DVD allows you to slam a second battery in, and to replace the main battery with a spare. Using two 9-cells and an UltraBay battery, my TP was able to survive a transatlantic and a couple of connection flights.

 2. Long running jobs
Phones/tablets are good for those on the move, but running a long job ties one to a DB server. Yes, the report can run in a VM on a “remote desktop” server, and the tablet can be used as a RDP client. Virtual or not, the desktop is still needed. The phone in such cases is nothing more than a “thin client”; i.e. a dumb keyboard-and-screen device. And a rather cumbersome one, to be honest.

3. Security
Irritated by the size of a laptop an employee puts some confidential data on his/her phone. The data is not encrypted… to save energy. Every security conscious person knows the rest.

Here’s my reply:

I guess a phone guy could argue that he can plug into an external battery to recharge his phone as well, or, with the right model, just swap the phone battery out. However, I still don’t think I’m getting more with a phone compared to what I already have with a laptop. Just comparing the memory, disk space, screen size and processor, I don’t understand why I’d want to go backwards.

I’ve always assumed that I’ll be able to buy a laptop for years to come, but I guess there are those who believe they’re no longer needed. Maybe for a less-demanding user, a phone is perfectly fine. I still want to know where all these displays are that are just waiting for us to hook our phones up to them, especially out on raised floors, etc. I haven’t tried to connect to a serial port on a machine using my phone, but I know it works great on my laptop.

Later in our discussion, the reader adds:

A phone user can have a folding stand for his phone, a folding keyboard, an extra battery, a micro-USB to USB cable, a USB card reader, etc. Have I heard anyone mentioning a folding display? The light-as-a-feather turns out to be cable spaghetti. The “phone and phone only” approach works for those who seldom need anything else. Yeah, one can use any display available. Will he present to the whole crowd around the report for the next shareholders meeting he is currently working on?

The TrackPoint was invented to save the split-second movements between the keyboard and the mouse. Why not replace that with zoom out/scroll/zoom in? I doubt it would be faster.

What about multitasking? Throughout the course of the last 20 years I was always faster than the computer. On my desktop I often open tens of tabs, reading one while the others are loading. Closing the just read one, and instantly reading the next. On my laptop I am limited by the screen size. That is just not possible on a phone, or it is painfully slow. So I am much more productive.

There are many applications the phones and the tablets are well suited for. There are quite a few they are not. The “one size fits all” dream is as elusive as ever.

Incidentally, eWeek and Business2Community both have recent articles opining that laptops and desktops will remain with us for the foreseeable future. So I guess not wanting to run everything from my phone doesn’t, at this point, make me a dinosaur.

A Fun Look Back at Technology

Edit: I still have a landline and a Model M.

Originally posted February 24, 2015 on AIXchange

I like watching many of the old movies aired on TCM (aka, The Movie Channel). In addition to enjoying the stories being told, I just love seeing the clothing and the buildings and the landscapes of bygone eras. While I understand that much of what I’m viewing is actually black and white footage of Hollywood sets and interiors — as opposed to the “real world” that existed back then — I still find it fascinating to hear how people talked and see how they went about their daily lives.

I know I’m far from alone in this belief, but as a fan of old films, I’m convinced that they aren’t many new ideas coming out of modern Hollywood. A lot of premises and situations that hit the big screen 75 or more years ago are still being recycled today.

With this in mind, I want to tell you about some old films that offer a glimpse into world of technology. A number of great videos are available on YouTube, and as much as our machines have changed over generations, a lot of the information being presented remains relevant.

For instance, check out this 1937 video (produced by Chevrolet) that explains the technology behind an automobile’s differential. Tempting as it may be to dismiss this film based on that dorky intro music alone, there’s valuable information here. The filmmakers do a great job of explaining how engineers were able to solve the problems associated with sending power to two rear wheels. Honestly, I never realized that automobile engines could only deliver power to one wheel prior to this innovation.

In 1965, IBM UK came out with this production that I believe still holds up nicely. Entitled “Man and Computer,” this video reduces a computer to five basic functions: input, memory, calculation, output and a control unit. It’s also fun to see the symbols they used to represent each of these terms — among them, adding machines and typewriters. Everything covered here — how computers use instructions, how those instructions become a program, basic on/off electrical states — is explained simply enough for the non-technical user to understand. (And keep in mind that a half century ago, almost no one used computers.)

This video was so good it had me thinking fondly about the days of punch cards. Luckily for me, I quickly discovered this video about punch cards.

As I said, technology has obviously and immeasurably changed since these old films were produced. Nonetheless, I think even in the computer world, some of our early innovations still have value. Consider computing’s timeline: Once, timeshare machines predominated. Eventually, we got personal computers. When we wanted our disparate computers to be able to communicate with one another, client/server emerged. Then came the public Internet, virtual desktops and the cloud. Oh and the mainframe: Wasn’t that supposed to die 20 years ago? Wasn’t proprietary UNIX supposed to go with it? Yet here we are in 2015 with powerful new mainframes and POWER8 processors.

Of course I love new technologies. New servers running POWER8 are so much more powerful than their predecessors. Naturally, I want to see progress. At the same time, I still use a landline phone. Landlines remain the best option for long-running conference calls. I never worry about poor cellular connections or drained batteries. In addition, I prefer the old-school ThinkPad keyboards to any current keyboard design. And obviously, I still cling to my model M keyboard.

Embracing what’s new is fine, but just because something is brand new, that doesn’t mean we should throw out everything that came before it.

Connecting Your HMC to IBM Support, Revisited

Edit: Still good information.

Originally posted February 17, 2015 on AIXchange

In this August 2014 post I discussed how to connect your HMC to IBM Support.

That post includes a link to a .pdf document that outlines the different connectivity options. However, this IBM technote seems easier to work with:

“The following is a list of ports used by the HMC. The “Inbound application” column identifies ports where the HMC acts as a server that remote client applications connect to. Examples of remote client applications include the browser based remote access and remote 5250 console. Ports used by remote clients need to be enabled in the HMC firewall. They must also be enabled in any firewall that is between a remote client and HMC.

The “Outbound application” column identifies ports where the HMC acts as a client, initiating communications to the port on a remote server. Functions are further classified as Intranet or Internet. Intranet functions are typically limited to communications between the HMC and another HMC, partition or server inside the network. Internet functions require access to the Internet, directly or, in some cases, via a proxy. Because UDP is a directionless protocol, the HMC firewall must be enabled for UDP ports even though the communications may be initiated from the HMC. “Outbound” application ports must be enabled in external firewalls for the function to work. …”

The document then provides a lengthy list of commonly used ports. It also lists some typical configurations:

  • Firewall between the HMC and remote users: 443, 9960, 12443, 2300, 2301, 22
  • Firewall between HMC and other HMCs/partitions: Bi-directional 657 tcp/udp, 9900 udp, 9920
  • Firewall between the HMC and the Internet: Internet VPN 500/4500 udp, outbound 80, 443; outbound FTP
  • Firewall between the HMC and the Managed Server: TCP 443, 30000, 30001

If you’re looking for more information on setting up your HMC to call home, here’s another good how-to document that discusses setting up AIX or Linux to use a management console to connect to IBM service and support:

“This procedure contains the complete list of steps that are needed to set up connectivity to service and support. Some of these steps might already have been completed during the initial server setup. If so, you can use this procedure to verify that the steps were completed correctly.

In this information, an Internet connection is defined as access to the Internet from a logical partition, server, or a management console by direct or indirect access. Indirect means that you are behind a network address translation (NAT) firewall. Direct means that you have a globally routable address without an intervening firewall, which would block the ports that are needed for communication to service and support.”

On an unrelated note, if you have an issue with VIO server tasks on the HMC, this document may be helpful:

Error “3003c 2610-366” after apply of Service Pack 1
Technote (troubleshooting)

Problem(Abstract)
The apply of V8R8.1.0 Service Pack 1 or V7R9.1.0 Service Pack 1 may cause some VIOS related tasks to fail. Impacted HMC tasks include Manage PowerVM and Manage partitions task in the new V8R8 “enhanced GUI” as well as the Performance and Capacity Monitor (PCM). External applications using the HMC REST API such as IBM PowerVC are also impacted. The error text will typically include the error message “3003c 2610-366 The action array contains an undefined action name at index 0: VioService.”

Contact IBM support for the circumvention until a fix is available.

Another POWER8 Development Option

Edit: There are easier ways to get access to hardware.

Originally posted February 10, 2015 on AIXchange

If you’re looking to develop software for POWER8 systems but don’t have access to POWER8 hardware, there are options like the virtual loaner program or some kind of test system. You should also be aware of the IBM POWER8 Functional Simulator:

The IBM POWER8 Functional Simulator is a simulation environment developed by IBM. It is designed to provide enough POWER8 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a Fedora 20 BE (Big Endian) kernel image or a Debian LE (Little Endian) kernel image. The intent for this tool is to educate, enable new application development, and to facilitate porting of existing Linux applications to the POWER8 architecture. While the IBM POWER8 Functional Simulator serves as a full instruction set simulator for the POWER8 processor, it may not model all aspects of the IBM Power Systems POWER8 hardware and thus may not exactly reflect the behavior of the POWER8 hardware.

Features
·        POWER8 hardware reference model
·        Models complex SMP effects
·        Architectural modeled areas:
o    Functional units (Load/Store, FXU, FPU, etc.)
o    Pipeline
o    Exceptions and Interrupt handling
o    Address translation
o    Memory and basic cache modeling (SLBs, TLBs, ERATs)
·        Linux and Hypervisor development and debug platform
·        Boots Fedora 20 (BE) and Debian (LE) kernel images
·        TCL command-line interface provides:
o    Custom user initialization scripts
o    Processor state control for debug: Step, Run, Cycle run-to, Stop, etc.
o    Register and Memory R/W interaction

Supported x86_64 host operating systems for running the IBM POWER8 Functional Simulator
· Fedora 20
· Red Hat Enterprise Linux 7.0
· Suse 12
· Ubuntu 14.10
Supported 64-bit Big Endian Linux distributions for booting the IBM POWER8 Functional Simulator
· Fedora 20
· Other distributions may function, however, no testing has been performed

For detailed information, check out the user guide and the command reference guide. I’ll highlight the user guide descriptions of the Simulator’s Linux and Standalone modes:

Linux Mode
In Linux mode, after the simulator is configured and loaded, the simulator boots the Linux operating system on the simulated system. At runtime, the operating system is simulated along with the running programs. The simulated operating system takes care of all the system calls, just as it would in a nonsimulation (real) environment.

Standalone Mode
In standalone mode, the application is loaded without an operating system. Standalone applications are usermode applications that are normally run on an operating system. On a real system, these applications rely on the operating system to perform certain tasks, including loading the program, address translation, and system-call support. In standalone mode, the simulator provides some of this support, allowing applications to run without having to first boot an operating system on the simulator.

Why not download the code and give it a try?

How Do You Handle Host Names?

Edit: Still worth putting thought into.

Originally posted February 3, 2015 on AIXchange

About a month ago this discussion hit the AIX mailing list. I’m posting the thread here to get your feedback.

First, the original question:

“Date:  Tue, 6 Jan 2015 15:41:01 -0600
From:  Russell Adams
Subject: Hostname as short name or FQDN?

Here’s a great question for the brain trust:

Which is actually the correct best practice for host names? The host name as a fully qualified domain name, or a short name?

Supporting documentation required! Thanks.”

And a reply:

“Date:  Tue, 6 Jan 2015 22:25:52 +0000
From:   Davignon, Edward
Subject: Re: Hostname as short name or FQDN?

Russell,

That is a really good question!

According to “man mktcpip”:
    -h HostName
        Sets the name of the host. If using a domain naming system, the domain and any subdomains must be specified. The following is the standard format for setting the host name:
        hostname
        The following is the standard format for setting the host name in a domain naming system:
        hostname.subdomain.subdomain.rootdomain

That being said, many sites use only the short name for the hostname.

Also keep in mind that “/etc/rc.net” sets the node name (uname -n) to the short name and the hostid based on the hostname (actually its IP address as resolved). “/etc/rc.net” sets the hostname based on the name for inet0 in the ODM.

It also brings up the questions of how to best configure names and aliases in “/etc/hosts” and how best to configure these in DNS or other naming services, so they match gethostbyaddr. Some related files are “/etc/resolv.conf”, “/etc/irs.conf”, and “/etc/netsvc.conf”. It has long plagued the community when gethostbyaddr (or gethostbyname) return different responses on the database server and the application server, because /etc/hosts does not match DNS.

I ran into a problem with this once with an early version of the DataGuard installer from Oracle. It got confused, since it did not have the FQDN in the hostname. The Oracle install guide clearly stated that the FQDN was required. This was the only time I have seen this matter.
Since we often cannot control data returned by naming services, it may be better to make sure gethostbyaddr or gethostbyname (i.e. the “host” command) return the same thing on all of the servers that use the hostname of the server you are configuring.

From “man uname”:  “-n” Displays the name of the node. This may be a name the system is known by to a UUCP communications network.”

Now, Russell’s reply to Edward:

“Date:   Tue, 6 Jan 2015 16:32:29 -0600
From:   Russell Adams
Subject: Re: Hostname as short name or FQDN?

A short hostname includes an empty subdomain.

I always use a short name, and then set domain in /etc/resolv.conf and ensure that the FQDN and short name are in /etc/hosts so reverse lookups fetch it.”

And finally, another reply from Edward:

“Date:   Wed, 7 Jan 2015 14:58:54 +0000
From:   Davignon, Edward
Subject: Re: Hostname as short name or FQDN?

A related question is how should /etc/hosts and DNS be configured for reverse lookups (i.e. lookups by address)?

Should /etc/hosts have “ipaddr fqdn shortname” or “ipaddr shortname fqdn”? Likewise for DNS, should the reverse lookup return “fqdn” or “shortname” or alternate using round robin?

I pose these as questions, but they are really things to check when troubleshooting applications that rely on name resolution.

DNS can be queried directly using “dig” or “nslookup”.

I have seen numerous misconfigured /etc/hosts files that don’t match DNS for reverse lookups. I have also seen DNS servers return “shortname” instead of “fqdn”. I have seen DNS alternate “fqdn” or “shortname”. I have also seen DNS return wrong domainnames, too.

Usually the problem I see is /etc/hosts has “ip shortname” or “ip shortname fqdn”, but DNS reverse lookups return “fqdn”. This causes inconsistency between the local server and the remove (app) servers, usually resulting in inconsistency of access controls between app servers, or an app server and its database server. This can also happen when someone changes /etc/netsvc.conf from empty to “hosts=local,bind4”. I use “grep ‘^[^#]’ /etc/netsvc.conf” to check it; it grabs non-blank lines that don’t start with a comment character.”

The discussion died out at this point, but it got me wondering what my readers typically do. I prefer to use a shortname for the host, and then make sure /etc/resolv.conf is set up correctly. Would any of you care to make an argument for having a FQDN in your environment?

Security Behind the Firewall

Edit: Still worth considering.

Originally posted January 27, 2015 on AIXchange

Although many of us like to assert that AIX running on Power hardware is a secure operating system, we must be aware of methods that might be used to try to compromise the systems we maintain. Just because the AIX user base is smaller than their Windows or Linux counterparts, we shouldn’t assume that AIX systems cannot be breached and aren’t being targeted. These systems typically run software for hospitals, banks, manufacturers, etc., industries where uptime and performance are critical and data privacy is essential.

With that in mind, this recently released document, entitled AIX for Penetration Testers, examines the delicate balance between providing user access and maintaining system security:

“AIX is a widely used operating system by banks, insurance companies, power stations and universities. The operating system handles various sensitive or critical information for these services. There is limited public information for penetration testers about AIX hacking, compared the other common operating systems like Windows or Linux. When testers get user level access in the system the privilege escalation is difficult if the administrators properly installed the security patches. Simple, detailed and effective steps of penetration testing will be presented by analyzing the latest fully patched AIX system. Only shell scripts and the default installed tools are necessary to perform this assessment. The paper proposes some basic methods to do comprehensive local security checks and how to exploit the vulnerabilities.

“The reconnaissance process is the most important task. If an auditor has enough information about the target system, applications and the administrator, it can lead to privilege escalation. After getting user level access on an AIX system, start by finding and exploiting operation issues caused by the administrator.”

Based on information in the document, here are some basic security questions to ask and answer:

* sudo: it is properly configured?

* umask settings: have they been changed from defaults?

* exploitable SUID/SGID binaries: do they exist on the system?

* the PATH: has it been set up properly?

“This methodology defines key local vulnerable points of AIX system. Auditors can make their own vulnerability detection scripts to decrease the time of the investigation based on this methodology. The suggested test steps are information gathering, exploit operation bugs, checking 3rd party software and finally the core system. Valuable information and great ideas are hidden in system guides, developer documentation and man pages. This methodology only describes quick and useable techniques. There are many other vulnerability assessment concepts worth the research, including syscall, signal or file format fuzzing.

“System administrators and auditors can apply useful hardening solutions from the vendor [IBM]. There is a secure implementation of the AIX system called Trusted AIX (IBM, 2014). The mentioned hardening features and guides can increase the local security level of the operating system. Hardening supplemented by professional penetration testing is the proper way to do security.”

Although many organizations like to think that being behind a firewall makes them secure, they forget that trusted users are behind many successful attacks.

What are you doing to protect your systems from unauthorized access and privilege escalation?

vtmenu and Life’s Little Annoyances

Edit: Still good to remember how to disconnect. And still worth asking, how many little annoyances do you just choose to live with?

Originally posted January 20, 2015 on AIXchange

Recently a friend asked me about vtmenu:

“You know when you run vtmenu and you exit using ~. and it disconnects your ssh session to the HMC. Do you remember the keystroke combination which will just return me to the vtmenu or HMC command line? Can’t find it anywhere.”

This has happened to me before, I enter ~. and instead of going back one level it completely disconnects me from my ssh session to the HMC. Usually I’m absorbed with some task or problem — in the zone, you could say — and I just return to the HMC console, run vtmenu again, and reconnect to my partition without giving it a thought. I consider it another of life’s minor annoyances, like remembering to run set –o vi or stty erase ^? if your profile hasn’t been set up.

Of course to new users, those annoyances can really add up. But surely there is a solution.

Another friend offered this suggestion:

“Use ~~.~. is also the openssh exit. It doesn’t affect putty and the Windows ssh clients, but if you’re on linux… you quit ssh.”

I found similar advice here. And when I searched the AIXchange archives, I rediscovered this post.

“You can also use the mkvterm –m -p command if you know the machine name and the LPAR name. I find vtmenu to be useful if you do not know that information off the top of your head. If you need to get the machine name, try lssyscfg –r sys, then use lssyscfg -r lpar -m –F to get a list of LPAR names. If someone else is using a console, or you left a console running somewhere else, you can use the rmvterm –m -p command. 

In any event, when you are done using a console, you can type ~~. in order to cleanly exit, and you will get a message that says Terminate session? [y/n]. Answer with y and you will go back to the vtmenu screen or to the command line, depending on what method you used to create the console.”

Pleased as I was to find this solution, it didn’t work for my friend. And because I couldn’t reproduce the problem, I was unable to offer further help. So the question remains: How do you cleanly disconnect from inside vtmenu? Hopefully my readers have some suggestions.

And in general, how accepting are you of these types of annoyances? Do you shrug them off, or do you put some effort into solving these problems? What about those of you who work on others’ machines where fooling around with .profiles and the like might not be appreciated?

Why I Choose NIM to Install VIOS

Edit: This is still good stuff.

Originally posted January 13, 2015 on AIXchange

In May 2013 I wrote about installing the VIO server using the HMC GUI. This more recent article covers the same topic. At the end of Bart’s post he mentions using the virtual media repository, something I covered here.

While I have used the HMC GUI, I prefer to set up the NIM server and use it to load the VIO server. Bottom line, using NIM is faster than using the HMC GUI.

Of course, there are instances when NIM isn’t an option, such as IBM i environments that run VIO servers. Another example is a new or still under construction data center. Some of my customers fall into this category, and as a result I frequently do system builds in data centers that don’t yet have a NIM server. Often in these situations I don’t even have a network available, because the network guys are simultaneously installing and configuring their gear. Until the network is up and running, the physical machine is all I have.

So what are the options at this point? I could run crossover cables and get the HMC to talk to a network adapter on the system, and then use the HMC to install the VIO server. If there’s physical media, installing from a DVD is an option, although with smaller systems that have split back planes it can be tricky to use the DVD to install to a second VIO server.

In the past I’ve loaded the VIO server to internal disk, created a virtual media repository and copied VIO and AIX DVDs over to that repository. Then I use those .iso images to build a NIM server that’s booting from a free local drive. Once the NIM server is built, I can use it to create my second VIO server via the internal network. At least this allows me to load LPARs across the internal virtual network while I wait for my physical network to be built out. If you’re on a strict timeline (and really, when are we not?), this method can help you be productive as you wait for the network to become available.

 I’ve also been in situations where the network was running, but VLAN tagging was in place. In such a scenario, I would go into SMS and set up VLAN tagging for my remote IPL to use for booting. However, there’s no option that I know of to define a VLAN within the HMC GUI (if that’s what you’re using to install VIO server). Sure, this can typically be handled by asking a network admin to temporarily change the VLAN configuration, but of course, some network guys are more amenable to such a request than others. It’s something to be aware of.

Here’s another advantage to using NIM rather than install from the HMC: I had a customer that wanted to set up a third test VIO server using the HMC GUI. They had a spare fibre card, but no spare network card. This wasn’t an issue since they could put the VIO server onto an existing internal VLAN and communicate externally via the existing shared Ethernet adapters on their other two VIO servers. The problem was the GUI only recognizes physical adapters, not virtual ones. Using NIM, we were able to get it to work.

What’s your preferred way to install new systems?