How Does Your Database Rate?

Edit: Do you ever check these?

Originally posted March 1, 2016 on AIXchange

The website db-engines.com rates “database management systems according to their popularity.”

The list has been around for a few years, and as this InfoWorld article notes, “It isn’t forensically precise, nor is it meant to be; it’s intended to give a sense of trends over time.”

The left nav bar contains database rankings by type, including relational databases (IBM’s DB2 is fifth on that list), key-value stores and document stores. You can also see how prevalent open source databases have become.

Here’s how the rankings are calculated:

“The DB-Engines Ranking is a list of database management systems ranked by their current popularity. We measure the popularity of a system by using the following parameters:

* Number of mentions of the system on websites, measured as number of results in search engines queries. At the moment, we use Google and Bing for this measurement. In order to count only relevant results, we are searching for <system name> together with the term database, e.g. “Oracle” and “database”.

* General interest in the system. For this measurement, we use the frequency of searches in Google Trends.

* Frequency of technical discussions about the system. We use the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.

* Number of job offers, in which the system is mentioned. We use the number of offers on the leading job search engines Indeed and Simply Hired.

* Number of profiles in professional networks, in which the system is mentioned. We use the internationally most popular professional network LinkedIn.

* Relevance in social networks. We count the number of Twitter tweets, in which the system is mentioned.

We calculate the popularity value of a system by standardizing and averaging of the individual parameters. These mathematical transformations are made in a way so that the distance of the individual systems is preserved. That means, when system A has twice as large a value in the DB-Engines Ranking as system B, then it is twice as popular when averaged over the individual evaluation criteria.

The DB-Engines Ranking does not measure the number of installations of the systems, or their use within IT systems. It can be expected, that an increase of the popularity of a system as measured by the DB-Engines Ranking (e.g. in discussions or job offers) precedes a corresponding broad use of the system by a certain time factor. Because of this, the DB-Engines Ranking can act as an early indicator.”

A blog posting went into further detail about the rankings:

“1) The Ranking uses the raw values from several data sources as input. E.g. we count the number of Google and Bing results, the number of open jobs, number of questions on StackOverflow, number of profiles in LinkedIn, number of Twitter tweets and many more.

2) We normalize those raw values for each data source. That is done by dividing them with the average of a selection of the leading systems in each source. That is necessary to eliminate the bias of changing popularity of the sources itself. For example, LinkedIn increases the number of its members every month, and therefore the raw values for most systems increase over time. This increase, however, is rather due to the growing adoption of LinkedIn and not necessarily resulting from an increased popularity of a specific system in LinkedIn. Giving another example: an outage of twitter would reduce the raw values for most of the systems in that month, but obviously has nothing to do with their popularity. For that reason, we are using a selection of the best systems in each data source as a ‘benchmark’.

3) The normalized values are then delinearized, summed up over all data sources (with weighting the sources), re-linearized and scaled. The result is the final score of the system.

The normalization step is the key to understanding the December results: the top three systems in the ranking (Oracle, MySQL and SQL Server) all increased their score. Oracle and MySQL gained formidable 16 and 11 points respectively. As a consequence the benchmark increased, leading to potentially less points for many other systems.

Why are we not using all systems as a benchmark for a data source? Well, we continuously add new systems to our ranking. Those systems typically have a low score (assuming that we are not missing major players). Then, each newly added system would reduce the benchmark and increase the score of most of the other systems.

Conclusion: it is important to understand the score as a relative value which has to be compared to other systems. Only that can guarantee a fair and unbiased score by eliminating influences of the usage of the data sources itself.”

While my work almost exclusively involves supporting systems with databases that run on AIX, I still find it worthwhile to learn more about other systems and databases. It’s good to know what else customers are working with.

Running LPM on Selected Partitions

Edit: This is still a valuable tidbit.

Originally posted February 23, 2016 on AIXchange

A while back a customer got word from Oracle that they would be charged for every core on a system that could be used for Live Partition Mobility, even cores that weren’t used by their Oracle database.

“IBM Power VM Live Partition Mobility is not an approved hard partitioning technology. All cores on both the source and destination servers in an environment using IBM Power VM Live Partition Mobility must be licensed.”

The customer found LPM very useful for performing maintenance on their hardware and rebalancing their workloads. They didn’t want to give it up, but naturally, they didn’t want to have to license every core on their machines, either.

To address this problem, they were looking for a way to disable LPM on their Oracle LPARs while still allowing their other LPARs to use LPM. Since LPM is enabled at the frame level with PowerVM Enterprise Edition, they were unsure how this could be done. Could they change the SAN zoning for these LPARs so they would be unable to run LPM? Or should they just bite the bullet and buy some smaller servers and completely segregate their Oracle workload onto frames with no LPM available? (They’re also considering migrating off of Oracle altogether.)

This post caught their eye. It describes how LPM can now be disabled on a per partition basis:

“HMC V8R8.4.0 introduces a new partition-level attribute to disable Live Partition Migration (LPM) for that partition. The HMC will block LPM for that partition as long as this attribute is enabled. This feature can be used by ISVs to deal with application licensing issues.

Some applications require the user to purchase a license for all systems that could theoretically host the running LPAR. That is, if an LPAR can theoretically be migrated (whether you do so or not) to a total of 4 managed systems, you may be required to purchase a software license for all 4 systems. If you don’t plan on ever migrating the LPAR hosting the application, then this attribute provides an audit-able mechanism to prevent the LPAR from ever being migrated. It should be noted that no IBM software has such licensing requirements.

One benefit of this attribute implementation is it is not dependent on the managed server firmware version so you can use this feature from the HMC enhanced+ GUI, REST API, or CLI on any system the HMC is managing.

One thing to note is that while NovaLink will honor this attribute in a co-managed environment, it does not provide anyway to alter the value.

Any change to this attribute is logged as a system event, and can be checked for auditing purposes. A system event will also be logged when the Remote Restart or Simplified Remote Restart capability is set. More specifically, a system event is logged when:

* any of these three attributes are set during the partition creation
* any of these three attributes are modified
* restoring profile data

Users can check system events using the lssvcevents CLI and/or the “View Management Console Events” GUI. Using HMC’s rsyslog support, these system events can also be sent to a remote server on the same network as the HMC.

1) Command to check which partitions managed by this HMC have LPM disabled or enabled

    lssvcevents -t console | grep vclient

time=10/30/2015 10:11:32,text=HSCE2521 UserName hscroot: Enabled partition migration for partition vclient10 with Id 10 on Managed system ct05 with MTMS 8205-E6D*1234567.

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

2) Command to check which partitions managed by this HMC have LPM disabled

    lssvcevents -t console | grep HSCE2520

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

3) Command to check which partitions managed by this HMC have LPM disabled or enabled for a particular managed server (1234567)

    lssvcevents -t console | grep “partition migration for partition” | grep 1234567

time=10/30/2015 10:11:32,text=HSCE2521 UserName hscroot: Enabled partition migration for partition vclient10 with Id 10 on Managed system ct05 with MTMS 8205-E6D*1234567.

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567.

4) Command to check if a specific partition (vclient9) in a specific managed server (1234567) managed by this HMC has LPM disabled or enabled

    lssvcevents -t console | grep “partition migration for partition vclient9” | grep 1234567

time=10/30/2015 10:01:35,text=HSCE2520 UserName hscroot: Disabled partition migration for partition vclient9 with Id 9 on Managed system ct05 with MTMS 8205-E6D*1234567″

Have you ever wanted or needed to disable LPM for specific LPARs, either due to an Oracle mandate or some other reason? Let me know in the comments.

Proud to be a (Returning) Champion

Edit: I am still proud to be a Champion.

Originally posted February 16, 2016 on AIXchange

Last fall I wrote about the relaunch of the IBM Champions program. Here’s how I described it back in 2011:

“Apple fanboy” is a moniker that’s sometimes given to those who love Apple products. Along those lines, I guess I’m a “Power fanboy.” I love the platform and the operating systems that run on it. I love the virtualization capabilities, the performance and the reliability. And, as readers of this blog surely know by now, I love telling others about Power Systems servers. I’ve been reading the articles and following the tweets of other Power Champions for some time, which makes me all the more proud to be included in this group and recognized for my efforts.

I was very proud to be chosen as an IBM Champion nearly five years ago, and I’m just as proud to be one of the 17 returning champions — among 34 selections overall — in 2016:

After much reviewing of applications and evaluating contributions, we’re happy to announce the 2016 IBM Champions for IBM Power! We have a total of 34 Champions, with 17 new Champions, and 17 returning Champions.

Congratulations to our 2016 IBM Champions for Power!

These individuals are non-IBMers who evangelize IBM solutions, share their knowledge, and help grow the community of professionals who are focused on IBM Power systems. IBM Champions spend a considerable amount of their own time, energy and resources on community efforts — organizing and leading user group events, answering questions in forums, contributing wiki articles and applications, publishing podcasts, sharing instructional videos, and more!

IBM Champions are also granted access to key IBM business executives and technical leaders to share their opinions, learn about strategic plans, and ask questions. In addition, they may be offered various speaking opportunities that enable them to raise their visibility and broaden their sphere of influence.

Look for an in-depth article on the IBM Champion program and profiles of some of the new IBM Champions for Power in the April issue of IBM Systems magazine.

It’s an honor for me, but I’m also happy for all the deserving recipients. I’ve learned so much from the other 33 people on this list, and I look forward to learning more from them in the future:

Congratulations to:

  •     Torbjorn Appehl
  •     Balazs Babinecz
  •     Aaron Bartell
  •     Alberto C. Blanch
  •     Shawn Bodily
  •     Benoit Creau
  •     Shrirang “Ranga” Deshpande
  •     Waldemar Duszyk
  •     Anthony English
  •     Pat Fleming
  •     Nigel Fortlage
  •     Susan Gantner
  •     David Gibbs
  •     Ron Gordon
  •     Midori Hosomi
  •     Tom Huntington
  •     Terry Keene
  •     Andy Lin
  •     Alan Marblestone
  •     Christian Masse
  •     Pete Massiello
  •     Rob McNelly
  •     Brett Murphy
  •     Jon Paris
  •     Mike Pavlak
  •     Trevor Perry
  •     Steve Pitcher
  •     Billy Schonauer
  •     Brian Smith
  •     David Tansley
  •     Paul Tuohy
  •     Jeroen Van Lommel
  •     Dave Waddell
  •     Charles Wright

Single-User Mode vs. Maintenance Mode

Edit: Avoid a resume generating event.

Originally posted February 9, 2016 on AIXchange

Recently I was telling a customer about the differences between booting into single-user mode and booting into maintenance mode. If you’re not familiar with these procedures, I recommend either using an existing LPAR or creating a new LPAR and trying them both. But before you do that, check out two valuable IBM support technotes (FAQs) that walk through each method.

This document tells you how to boot AIX to single-user mode to perform maintenance. (Note: You’ll need to know the root password to do this):

In AIX we don’t tend to use single-user mode very much, because many problems require having the rootvg filesystems unmounted for repairs. However, there are some instances when it’s beneficial to use single-user:

  • The system boot hangs due to TCP/IP or NFS configuration issues
  • [To] do work on non-root volume groups
  • To debug problems with entries in /etc/inittab
  • To work on the system without users attempting to log in
  • To work without applications starting up
  • It is easy to unmount /tmp and /var if they need to be checked with fsck or recreated

If the system boots fine from the rootvg, then booting into single-user to repair or perform work has advantages:

  • It boots quicker than Maintenance Mode.
  • You can boot off the normal system rootvg without finding AIX Install media or setting up a NIM SPOT.
  • It allows you to run all commands you would normally have access to in multiuser.
  • Unlike maintenance mode, there is no possibility that hdisks will be renamed.

Procedure
Standalone System (no HMC):
1. Boot system with no media in the CD/DVD drive
2. Wait until you see the options of choosing another boot list, and hear beeps on the console
3. Press 6 to start diagnostics.

System using an HMC:
1. Select the LPAR in the HMC GUI
2. Select Operations -> Activate
3. In the Activate window, click the button that says “Advanced”
4. Change “Boot mode” to “Diagnostic with stored boot list”
5. Click “OK” to save that change, then “OK” again to activate.

More menu options follow, so be sure to read the whole thing.

This doc tells you how to boot into maintenance mode on AIX systems to perform maintenance on the rootvg volume group or restore files from an mksysb backup.

There is a variety of media that can be used to boot an AIX system into Maintenance Mode. These consist of:
    1.A non-autoinstall mksysb taken from the same system or another system running the same level of AIX, either on tape or CD/DVD media.
    2.AIX bootable installation media (CD or DVD).
    3.A NIM server with a SPOT configured, and set up to boot this machine for maintenance work.

For certain work it is important to have the exact same level (AIX version, Technology Level, and Service Pack) on the boot media as is installed on disk. In these cases if the system is booted with different levels, the rootvg filesystems and commands may not be available to use.

This portion of the doc is found under the heading, Maintenance Mode Options:

At this point a decision must be made.

Option 1 will attempt to mount the rootvg filesystems and load the ODM from /etc/objrepos. If this works you will have full access to the rootvg filesystems and ODM, so you may run commands such as bosboot, rmlvcopy, syncvg, etc. If the version of AIX you have booted from (either from media or NIM SPOT) is not exactly the same as on disk, this will error and fail to mount the filesystems.

Option 2 will import the rootvg and start an interactive shell before mounting any filesystems. This interactive shell has very few commands available to it. As it has not mounted any filesystems from the rootvg it does not have access to rootvg files or the ODM. Use this option when performing maintenance on the rootvg filesystems themselves, such as fsck, rmlv, or logform.

    1) Access this Volume Group and start a shell
    2) Access this Volume Group and start a shell before mounting filesystems

This portion of the doc is found under the heading, Notes on Maintenance Mode:

1.The terminal type is not usually set up correctly for using the vi editor (in Option 2 only). To set this type:
    # export TERM=xterm

2.If you mount any rootvg filesystems (either automatically under Option 1 or by hand under Option 2) and change any files you must manually sync the data from filesystem buffer cache to disk. Normally the syncd daemon does this for you every 30 seconds, but no daemons are running in maintenance mode. To sync the data type:
    # sync; sync; sync

3.Typically there is no network connectivity in maintenance mode, so FTP or telnet are not available.

4.If you are in Option 2 with no filesystems mounted and wish to mount the filesystems and load the ODM you can type:
    # exit

Leaving Maintenance Mode
If you have chosen Option 2 and you have not mounted any filesystems by hand, just shut down the LPAR (via the HMC) or power off a standalone server. If you are ready to boot AIX to multiuser then activate the LPAR or if a standalone server power it on via the front panel.

If you have chosen Option 1 type these commands to reboot the system:
    # sync; sync; sync; reboot

Again, read both documents in their entirety to learn more. And because preparation is always worthwhile, I’ll add that it’s always a good time to verify that you have good mksysb images that you could use if needed. 

Speaking of the importance of preparation, not long ago I heard from someone whose server failed to bootup after a recent power outage. They were booting from local disks and didn’t have their rootvg mirrored. They did not have backups. They did have a very a bad day. Some folks refer to this type of situation as an RGE, or a resume generating event. With minimal effort now, you can avoid the same fate.

New Solutions to the Age-Old Problem of Memory Errors

Edit: Yet another reason to look at Enterprise class hardware.

Originally posted February 2, 2016 on AIXchange

This article made the rounds on Twitter awhile ago. It’s worth your time if you haven’t read it:

Not long after the first personal computers started entering people’s homes, Intel fell victim to a nasty kind of memory error. The company, which had commercialized the very first dynamic random-access memory (DRAM) chip in 1971 with a 1,024-bit device, was continuing to increase data densities. A few years later, Intel’s then cutting-edge 16-kilobit DRAM chips were sometimes storing bits differently from the way they were written. Indeed, they were making these mistakes at an alarmingly high rate. The cause was ultimately traced to the ceramic packaging for these DRAM devices. Trace amounts of radioactive material that had gotten into the chip packaging were emitting alpha particles and corrupting the data.

Once uncovered, this problem was easy enough to fix. But DRAM errors haven’t disappeared. As a computer user, you’re probably familiar with what can result: the infamous blue screen of death. In the middle of an important project, your machine crashes or applications grind to a halt. While there can be many reasons for such annoying glitches—including program bugs, clashing software packages, and malware—DRAM errors can also be the culprit.

For personal-computer users, such episodes are mostly just an annoyance. But for large-scale commercial operators, reliability issues are becoming the limiting factor in the creation and design of their systems.

Most consumer-grade computers offer no protection against such problems, but servers typically use what is called an error-correcting code (ECC) in their DRAM. The basic strategy is that by storing more bits than are needed to hold the data, the chip can detect and possibly even correct memory errors, as long as not too many bits are flipped simultaneously. But errors that are too severe can still cause machines to crash.

There was some unquestionably good news. For one, high temperatures don’t degrade memory as much as people had thought. This is valuable to know: By letting machines run somewhat hotter than usual, big data centers can save on cooling costs and also cut down on associated carbon emissions.

One of the most important things we discovered was that a small minority of the machines caused a large majority of the errors. That is, the errors tended to hit the same memory modules time and again.

The bad news is that hard errors are permanent. The good news is that they are easy to work around. If errors take place repeatedly in the same memory address, you can just blacklist that address. And you can do that well before the computer crashes.

When you consider all the effort that goes into making today’s servers even more reliable, I think it’s even more impressive to consider how IBM has designed Power Systems. From the E870/E880 Redbook:

2.3.6 Memory Error Correction and Recovery
The memory has error detection and correction circuitry is designed such that the failure of any one specific memory module within an ECC word can be corrected without any other fault.
In addition, a spare DRAM per rank on each memory port provides for dynamic DRAM device replacement during runtime operation. Also, dynamic lane sparing on the DMI link allows for repair of a faulty data lane.

Other memory protection features include retry capabilities for certain faults detected at both the memory controller and the memory buffer.

Memory is also periodically scrubbed to allow for soft errors to be corrected and for solid single-cell errors reported to the hypervisor, which supports operating system deallocation of a page associated with a hard single-cell fault.

2.3.7 Special Uncorrectable Error handling
Special Uncorrectable Error (SUE) handling prevents an uncorrectable error in memory or cache from immediately causing the system to terminate. Rather, the system tags the data and determines whether it will ever be used again. If the error is irrelevant, it does not force a checkstop. If the data is used, termination can be limited to the program/kernel or hypervisor owning the data, or freeze of the I/O adapters controlled by an I/O hub controller if data is to be transferred to an I/O device.

4.3.10 Memory protection
The memory buffer chip is made by the same 22 nm technology that is used to make the POWER8 processor chip, and the memory buffer chip incorporates the same features in the technology to avoid soft errors. It implements a try again for many internally detected faults. This function complements a replay buffer in the memory controller in the processor, which also handles internally detected soft errors.

The bus between a processor memory controller and a DIMM uses CRC error detection that is coupled with the ability to try soft errors again. The bus features dynamic recalibration capabilities plus a spare data lane that can be substituted for a failing bus lane through the recalibration process. The buffer module implements an integrated L4 cache using eDRAM technology (with soft error hardening) and persistent error handling features.

For each such port, there are eight DRAM modules worth of data (64 bits) plus another DRAM module’s worth of error correction and other such data. There is also a spare DRAM module for each port that can be substituted for a failing port.

Two ports are combined into an ECC word and supply 128 bits of data. The ECC that is deployed can correct the result of an entire DRAM module that is faulty. This is also known as Chipkill correction. Then, it can correct at least an additional bit within the ECC word.

The additional spare DRAM modules are used so that when a DIMM experiences a Chipkill event within the DRAM modules under a port, the spare DRAM module can be substituted for a failing module, avoiding the need to replace the DIMM for a single Chipkill event.

Depending on how DRAM modules fail, it might be possible to tolerate up to four DRAM modules failing on a single DIMM without needing to replace the DIMM, and then still correct an additional DRAM module that is failing within the DIMM.

In addition to the protection that is provided by the ECC and sparing capabilities, the memory subsystem also implements scrubbing of memory to identify and correct single bit soft-errors. Hypervisors are informed of incidents of single-cell persistent (hard) faults for deallocation of associated pages. However, because of the ECC and sparing capabilities that are used, such memory page deallocation is not relied upon for repair of faulty hardware.

Finally, should an uncorrectable error in data be encountered, the memory that is impacted is marked with a special uncorrectable error code and handled as described for cache uncorrectable errors.

The Reliability, Availability, and Serviceability characteristics that are built into Power hardware (not just the memory subsystem) is just one of the many reasons I enjoy working on these systems.

Thoughts on SAP HANA’s Availability on Power Systems

Edit: Still the best place to run it.

Originally posted January 26, 2016 on AIXchange

I assume you’ve heard by now that SAP HANA is available on IBM Power Systems.

With this release, SAP HANA on IBM Power Systems is supported for customers running SAP Business Warehouse on IBM Power Systems. This solution is available on SUSE Linux, for configurations initially scaling-up to 3TB. This is available within the Tailored Datacenter Integration (TDI) model, which will enable customers to leverage their existing investments in infrastructure.

A large pharmaceutical company had a 100X improvement in query performance, and an 88% reduction in ETL execution time compared to what they had running the same workload on their legacy database. In another instance, a large energy provider saw a 95% reduction in query response times compared to running those same queries against a legacy database.

There was also this interesting post from Alfred Freudenberger, North America Power Systems sales executive, IBM Power Systems for SAP Environments. Some highlights:

In November, 2015, SAP unleashed a large assortment of support for HoP. First, they released a first of a kind support for running more than 1 production instance using virtualization on a system. For those that don’t recall, SAP limits systems running HANA in production on VMware to one, count that as 1, total VMs on the entire system.

SAP took the next step and increased the memory per core ratio on high end systems; i.e. the E870 and E880, to 50GB/core for BW workloads thereby increasing the total memory supported in a scale-up configuration to 4.8TB.

What does this mean for SAP customers? It means that the long wait is over. Finally, a robust, reliable, scalable and flexible platform is available to support a wide variety of HANA environments, especially those considered to be mission critical. Those customers that were waiting for a bet-your-business solution need wait no more.

Here’s another perspective:

In that blog, he did an excellent job of explaining how technical enhancements at a processor and memory subsystem level can result in dramatic improvement in the way that HANA operates. Now, I know what you are thinking; he likes what Dr. Plattner has to say about a competitor’s technology? Strange as it may seem, yes … in that he has pointed out a number of relevant features that, as good as Haswell-EX might be, POWER8 surpassed, even before Haswell-EX was announced.

All of these technical features and discussion are quite interesting to us propeller heads. Most business people, on the other hand, would probably prefer to discuss how to improve HANA operational characteristics, deliver flexibility to respond to changing business demands and meet end user SLAs including response time and continuous availability. This is where POWER8 really shines. With PowerVM at its core, Power Systems can be tailored to deliver capacity for HANA production to ensure consistent response time and peak load capacity during high demand times and allow other applications and partitions to utilize capacity unused by the HANA production partition. It can easily mix production with other production and non-production partitions. It features the ability to utilize shared network and SAN resources, if desired, to reduce data center cost and complexity. POWER8 delivers unmatched reliability by default, not as an option or a tradeoff against performance.

By comparison, SAP has only one certified benchmark for which HANA systems have been utilized called BW-EML. Haswell-EX cpus were used in the 2B row Dell PowerEdge 930 benchmark and delivered an impressive 172,450 Ad-hoc Navigation Steps/Hr. This is impressive in that it surpassed the previous IvyBridge based benchmark of 137,010 Ad-hoc Navigation Steps/Hr on the Dell PowerEdge R920, an increase of almost 26% which would normally be impressive if it weren’t for the fact that the system includes 20% more cores and 50% more memory. By comparison, POWER8 delivered 192,750 Ad-hoc Navigation Steps/Hr with the IBM Power Enterprise System 870 or 12% more performance with 45% fewer cores and 33% less memory resulting in twice the performance per core.

Finally, check this out:

Take for example, the SAP BW Enhanced Mixed Load (BW-EML) Standard Application Benchmark on four-socket servers. This benchmark has documented that POWER8 cores out-perform Haswell EX cores by two times while running SAP HANA analytics workloads.

That’s not even the best part. I have been impressed with the capability of the POWER8 line to scale to much higher core counts. The scaling ability of POWER8-based servers is key to both enabling workload consolidation and removing the need to break large datasets across multiple nodes which would otherwise negatively impact the latency of queries.

Of course, the performance and scaling attributes of Power Systems are only part of the story. The enterprise-grade resiliency and flexible capacity features that Power Systems are known for become increasingly important to clients as they deploy in-memory analytics capabilities. SAP HANA availability across the entire POWER8 product line allows our existing clients to quickly and easily extend these benefits to HANA by simply allocating additional capacity on their infrastructure.

We continue to collaborate and partner with SAP to optimize and tune in-memory database performance for Power Systems, including further leveraging of SIMD instructions, transactional memory, and other acceleration features in POWER. With the successes we’ve seen in running these challenging in-memory workloads on our enterprise-class servers, we’re off to a great start, one that clients are sure to find highly beneficial while balancing the explosion of data in their day to day business operations.

If your enterprise is considering deploying SAP HANA, have you thought about running it on Power Systems?

Another HMC Goody: myHMC Mobile

Edit: Does anyone use this?

Originally posted January 19, 2016 on AIXchange

After trying out the HMC virtual appliance (vHMC), I wanted to examine the myHMC mobile application. The app, which came out last summer, is designed to allow you to manage HMC devices from your phone.

For more, watch this video, and read  Appendix A from this IBM Redbook.

myHMC is an Android or iOS application that lets you connect to and monitor managed objects on your Power systems Hardware Management Console (HMC). Monitoring includes the status of your Managed Systems, Logical Partitions/Virtual Machines and VIO servers. The application also allows you to view Resource Groups, Serviceable Events and Performance Data.

Since I have an Android phone, I downloaded the app from Google Play. Apple users can download an iOS version from iTunes.

Once you install myHMC there is a built in demo HMC inside the app for you to play with, though if you have the proper network connectivity and user ID and password information, you should be able to connect it to your own HMC.

I went ahead and connected the myHMC app on my phone to the vHMC running in VMware on my local network — although obviously I’d need to VPN in or have my mobile device connected to a corporate network in order to use it there. It’s a minimal interface, but it does provide a useful read-only view of HMC information.

You can see your managed systems, VIO servers, logical partitions and resource groups in the Resources section of the app. The errors and notifications section displays your serviceable events, and allows you to drill down for details about events and errors. The more information section provides the HMC serial number, machine type, HMC code version and build level.

A dashboard view displays the HMCs that are online, the attention LEDs, the events and the status of managed systems — including whether they’re powered on, operating, initializing or in standby mode. In the logical partitions view, the options are not-activated, running, suspended, open firmware and migrating running.

Under settings, you can find information such as how to use this app, release notes and open source licenses. How to use this app brings you to six pages of information, including screen shots that help you understand how to navigate (although it’s fairly self-explanatory once you try it out). You’re told you can switch between your HMCs and your dashboard view. Use the + key to add an HMC (obviously you’d first need to set up the HMC to allow remote connections and remote operation just like you would normally set up to allow remote access to your HMC). Individual HMCs can be edited or deleted by holding the corresponding icon, while application settings are available from the overflow menu icon. To send the developers feedback about the application, simply shake your phone while the app is running.

Again, all of the information in the application is read-only. At least I didn’t see any way to modify anything on my HMC from the application. Perhaps you found something that I’ve overlooked? Be sure to let me know what you find as you use the app.

What do you think? Do you have the connectivity you need into your data center to make this application useful to you?

Testing Out the New vHMC

Edit: Do you use this in your environment?

Originally posted January 12, 2016 on AIXchange

Have you ever wished you could run HMC code on your laptop? Sure, there are unsupported work-arounds, but the new HMC virtual appliance (vHMC) makes this task much simpler to accomplish, and it’s supported by IBM. Read all about it in section 3.2 of this Redbook.

Although IBM designed the vHMC solution for use in data centers — either as a backup to an existing physical HMC or by itself as a primary HMC solution — I wanted to know if it would actually run on a laptop. In truth, I just had to know.

Before getting into the install process, a bit about the vHMC itself. It allows you to manage Power servers from your existing VMware environment. In addition, some high availability solutions can be set up around your vHMC VM in existing VMware environments. This is especially useful for smaller customers that want to manage one or two smaller Power servers (located either on-site on remotely) without the need for dedicated HMC hardware.

The FAQs and general info that follows can be found in this document. I recommend reading the entire thing.

Support for vHMC firmware, including how-to and usage, is handled by IBM software support similar to the hardware appliance. When contacting IBM support for vHMC issues specify “software support” (not hardware) and reference the vHMC product identification number (PID: 5765-HMV).

How-to, install, and configuration support for the underlying virtualization manager is not included in this offering. IBM has separate support offerings for most common hypervisors which can be purchased if desired.

Q: How can I tell if it’s a vHMC?
A: To determine if the HMC is a virtual machine image or hardware appliance, view the HMC model and type. If the machine type and model is in the format of “Vxxx-mmm,” then it is a virtual HMC.

From command line (CLI) use the lshmc -v command and check the *TM field for a model starting with “V” and/or the presence of the *UVMID fields… .

Q: Are existing HMC customers entitled to vHMC?
A: No. vHMC is a separate offering and must be purchased separately. There is no conversion and no upgrade offering at this time.

Q: Are there any restrictions related to on-site warranty support for managed servers?
A: Restrictions are similar to the hardware appliance. You must supply a workstation or virtual console session located within 8 meters (25 feet) of the managed system. The workstation must have browser and command line access to the HMC. This setup allows service personnel access to the HMC. You should supply a method to transfer service related files (dumps, firmware, logs, etc) to and from the HMC and IBM service. If removable media is needed to perform a service action, you must configure the virtual media assignment through the virtualization manager or provide the media access and file transfer from another host that has network access to HMC.

Q: Can the vHMC be hosted on IBM POWER servers?
A: No, the current offering is only supported on Intel hardware. See release notes for the requirements.

Q: Is DHCP/private network supported?
A: Automatic configuration of a private DHCP network interface at install time by the activation engine is not supported. Manually configuring a private DHCP network using the HMC GUI/CLI is supported the same as with the hardware appliance. Note that a private DHCP network requires an isolated network to the managed server FSPs. Using the hypervisor to configure an isolated private network is outside the scope of vHMC. As with the hardware appliance, vHMC does not support VlAN tagged packets.

As noted in these installation instructions, the vHMC supports the kernel-based virtual machine (KVM) and VMware virtualization hypervisors. Here are the minimum requirements for running it:

  • 8 GB of memory
  • 4 processors
  • 1 network interface (maximum of 4 allowed)
  • 160 GB of disk space (recommended: 700 GB to get adequate performance and capacity monitoring (PCM) data)

Note: The processor on the systems that host the HMC virtual appliance must be either an Intel VT-x or an AMD-V hardware virtualization-enabled processor.

In my test environment I saw tolerable performance with 1 CPU and 4G of memory. Of course I wouldn’t recommend running it that way in production.

Also remember: The vHMC itself isn’t monitored. IBM has no visibility into all the different types of hardware on which this code could run:

Callhome for serviceable events with a failing MTMS of the virtual HMC appliance are not called home to IBM hardware service. The virtual HMC appliance is a software only offering with no associated hardware as provided in the HMC hardware appliance. Serviceable events reported against the vHMC appliance can be reported manually to IBM software support by phone or the IBM service web site.

Callhome for serviceable events on the managed servers and partitions, which will have “Failing MTMS” of the server, works the same on the virtual HMC as on the hardware appliance.

I had a copy of VMware workstation on my laptop, and I first tried the KVM version of the vHMC code inside of a Linux VM running KVM. First, I had to get the code. After confirming that I had entitlement, I went to the ESS website, clicked on entitled software, and then selected software downloads. When prompted for my operating system, I selected other. This brought up the 5765-HMV Power HMC Virtual Software Appliance. After selecting that, I was able to choose:

    tar.gz Download README
    TGZ, ESD – Virtual HMC V8.8.4 for VMware 11/2015
    TGZ, ESD – Virtual HMC V8.8.4 for KVM 11/2015

The actual files that were downloaded were named:

    README_for_tar_gz_Downloads_3-2007.tar.gz
    ESD_-_Virtual_HMC_V8.8.4_for_VMware_112015.tar.gz
    ESD_-_Virtual_HMC_V8.8.4_for_KVM_112015.tar.gz

After unzipping and untarring the files, I fiddled around with nested VMs to see if I could get the KVM vHMC code working in a Linux VM that was running inside of VMware:

Most hypervisors require hardware-assisted virtualization (HV). VMware products require hardware-assisted virtualization for 64-bit guests on Intel hardware. When running as a guest hypervisor, VMware products also require hardware-assisted virtualization for 64-bit guests on AMD hardware. The hardware-assisted virtualization features of the physical CPU are not typically available in a VM, because most hypervisors (from VMware or others) do not virtualize HV. However, Workstation 8, Player 4, Fusion 4, and ESXi 5.0 (or later) offer virtualized HV, so that you can run guest hypervisors which require hardware-assisted virtualization.

With virtualized HV enabled for the outer guest, you should be able to run any guest hypervisor that requires hardware-assisted virtualization. In particular, this means that you will be able to run 64-bit nested guests under VMware guest hypervisors.

Although I checked the correct box to enable nested virtualization, in my early tests the performance was too sluggish to get much done. I got vHMC to boot inside of KVM from within VMware, but it was far simpler to just run it in KVM or VMware natively.

I still wanted to get the KVM version to work, so I loaded Redhat Linux Enterprise Edition on an old standalone desktop machine. I copied the KVM file over to the Linux machine, and clicked on create a new virtual machine. The directions came right from the Redbook cited at the beginning. I selected import an existing disk image, left my OS as generic, set my memory and CPU settings, gave it a name, and clicked on finish. It came right up just as expected. Then I switched over and concentrated on my VMware instance.

To get vHMC to deploy in VMware, I clicked on one of the files (vHMC.ova) that was uncompressed from the VMware tarball I previously downloaded. I was prompted for the name of my new virtual machine and the path where it would live on the disk. I then clicked on import.

From there, everything else happened automagically. It was set to use thin provisioned disk, which only took up a little space on my machine, about 8G or so. By selecting “power on this virtual machine,” my vHMC came right up.

I did the next steps on both the KVM and VMware versions. I was first prompted to change my locale, so I told it to exit and not prompt again. I did likewise when prompted about changing my keyboard. Finally, I was asked to accept the license agreement. In short, everything worked pretty much as it would in a fresh HMC install on standalone hardware.

After accepting the license, the guided setup prompts came up. I skipped over that and the Callhome setup, since neither is necessary for my sandbox environment. At least that’s what I thought. It turns out though that the guided setup is where you create a password for your hscroot user ID. Having not done so, I couldn’t logon. So I rebooted the VM and tried again, this time running the guided setup. I chose my timezone, set up a password for hscroot and root, and skipped over the option to create other users.

For my networking I used an open network with an address of 127.0.0.1, and skipped over firewall settings. I told it I didn’t want to configure another adapter. In addition, I didn’t change the hostname or the domain name. I didn’t put in a gateway address or a gateway device. I told it I didn’t want to use DNS, and I skipped over setting up the SMTP server. Then I clicked on finish. After closing the wizard, it reset the GUI and allowed me to login as hscroot. (Each time you login you get a “tip of the day,” which is another thing I skipped.) Finally, I looked at my HMC version and indeed saw I was running 8.8.4.0, on a model type Vxxx-mmm.

My sandbox performance is pretty good, especially considering that this is an undersized VM that competes for resources with other VMs. On top of that, my test machine only has 8G of physical RAM installed. Obviously in a lab environment performance isn’t really a priority. In a real environment this code would be as snappy as you’d find on dedicated hardware.

Another nice thing is that suspend and resume functions the same as it does in other VMs you might be used to on KVM or VMware. It’s a simple matter to get it out of the way to free up resources; then when you’re ready to get back to it, you pick right up where you left off.

Finally, I appreciate that the process of installing fixes seems identical to what we’re used to with standalone HMCs. Since my VMware internal switch was set up to give my vHMC an address, I obtained one using DHCP by going into my HMC network adapter settings. I changed that setting from a fixed IP to DHCP and got it on the network. Then I was able to go into updates by selecting update HMC. IBM Fix Central didn’t show any vHMC updates, but there were regular HMC updates (MH01560 and MH01588), so I tested those on my sandbox server. Everything worked fine.

There remain many sound reasons to have an isolated standalone management machine serve as your infrastructure point of control. For starters, I believe that the KISS concept still has its advantages when it comes to managing critical hardware. However, the vHMC does offer another option for managing our machines, and I’m sure that adoption will grow as users get more comfortable with it. Testing it out in your environment will be the first step.

Can you see yourself using this solution in place of a dedicated HMC? Please share your thoughts in comments, along with any requests for other tests I can run with the vHMC.

Simon Scripts

Edit: Still good stuff.

Originally posted January 5, 2016 on AIXchange

For years I’ve been asking you to send me scripts. Sharing your scripting abilities benefits us all. We can use them as is, or as a starting point to create scripts that could help others.

Sometime I find scripts — take these, for instance (herehere and here). Regardless of their origin, I share them when I get permission from their authors. It’s a win-win.

With that in mind, here’s a script that Simon Taylor recently sent me:

Simple script to check/extend dump device. If I wanted to get fancy, I would cron it, read errpt output, and limit the size of dumpdev based on free space in rootvg. But then it wouldn’t be simple.

    #!/usr/bin/ksh
    primary=`sysdumpdev -l | awk ‘/primary/ {print $2}’|cut -d / -f3`
    echo primary is $primary
    ppsz=`lsvg rootvg|awk ‘/SIZE:/ {print $6}’`
    estimated=`sysdumpdev -e|awk ‘{print $NF/1024/1024/’$ppsz’}’`
    let estimated=estimated+1
    real=`lslv $primary|awk ‘/^LPs:/ {print $2}’`
    echo estimated size is $estimated, real is $real
    if [ $real -lt $estimated ] ; then
            let extend=estimated-real
            let extend+=1
            if [ “$1” = extend ] ; then
                    extendlv $primary $extend && echo “extended dump device”        
            else
                    echo ‘call with arg “extend” to extend dump device’
            fi
    fi

I asked Simon if he had other scripts he could share, and he provided several. They’re packaged in this tarball. What follows is from his README file, which is included in the .tar file below.

A collection of (hopefully) useful scripts organized in a couple of directories.

All the scripts that initiate communication with remote hosts assume that they run from an account that has a root ssh key on the remote host.

scripts directory

 menu 

– a simple menu program written in perl in the late 90’s as an antidote         to compiled menu programs with licenses and incomprehensible menu
formats. Documentation in the script and menu file.
Call with a menu name otherwise script looks for
         $(dirname $0)../menus/main.mnu

 qdump

– korn shell script to display the difference in disk blocks between
    the size of the system dump device and the output of “sysdumpdev -e”.
    Call with arg “extend” to extend the dump device to estimated size + 1

menus directory
main.mnu – a sample menu for the menu script. Will display this readme and run the scripts. Does not require the .mnu suffix.
Try scripts/menu
——————————————————————————–

doc_vio_disks directory

 doc_vio_disks – maps vio server and client disks. Reports on misconfiguration.
 Call with arg vio_server_name.
 Script will find the partner vio server, the managing hmc and the clients.

  Assumptions:
  1. User account running the command has root keys on lpars and vio servers
  2. User account running the command has hscroot keys on managing hmcs
  3. Frame has two vio servers and both serve vscsi disks to clients

 support_scripts – subfolder containing scripts used by doc_vio_disks
   chk.disks – collects disk info : name, lun, storage serial, size, vg
       display_vio_diskmap – displays output
   disp_vdisk – collects and formats client disk data
   knock.pl – general purpose script to test connection to ip/socket pair
   get_vioserver_data – collects and formats selected prtconf type data
   get_device_map – collect hmc device map

 doc_vio_disks consists of all these bits because originally it was meant to
 answer the question “What’s the next free disk on the vio servers?”. It started as a means to parse data collected manually from the vio servers and the hmc.  I normally display output in two windows side by side to make errors/problems show up.

——————————————————————————–

pmksysb
 pmksysb – script to pull a mksysb from a server using ssh and a fifo
 pmksysb_client – pushed to the client to run the mksysb

 Written to avoid the annoyance of trying to manage nfs mounts and distributed cron jobs. Can be controlled from the central server using a simple script and file containing “day of month” “local target directory” “server name”.
 The simple script invokes pmksysb on “day of month”, writes “server name” mksysb data to local “target directory”.

 This is the help displayed if pmksysb is run without arguments:
pmksysb -c client_hostname
        [ -d local_directory (default /export/mksysb) ]
        [ -f local_file (default client.mksysb) ]
        [ -v ] verify the local mksysb output
        [ -o ] overwrite existing local_file
        [ -n ] skip mkszfile on client
        [ -s ] skip mkvgdata on client
        [ -z ] gzip the completed mksysb
        [ -k kill_time (default 1 hour) ]
        [ -m mail_file ] merge mail for transmission to someone

Take a mksysb of a remote client using a named fifo.
Also runs savewpar on wpar clients.
The -k flag is meant to be used to prevent the client task from being
killed within 1 hour (done to prevent orphan processes on slow systems)

Behaviour is further modified by optional environment variables
RUNAS – local user with root ssh key on remote system (default root)
LOCAL_BIN – local location of pmksysb_client (default /usr/local/bin)
MKSYSB_DIR – local directory which will receive mksysbs (default /export/mksysb)

——————————————————————————–

where directory
 where_is – looks for a server in the file written by hierarchy.pl and displays
            where the server is.

 where_is a_server  a_server found on a_frame, hmc is a_hmc, vio is a_vio another_vio

 hierarchy.pl

– collects cec and lpar data from hmcs writes a list containing hmc, cec, lpar info

 Sample crontab entry for hierarchy.pl
# crontab entry – midday, because most systems will be up.
# This is why where_is fools with fuser on the data file in case it is
# still being written

0 12 * * * /some/location/hierarchy.pl/some/location/full_hmc.names

# /some/location/full_hmc.names just contains hmc names, one per line

st7392020@gmail.com
——————————————————————————–“

As always, feel free to send me your scripts and I’ll happily share them.

The Important Work of Certification Test Writing

Edit: Some links no longer work.

Originally posted December 22, 2015 on AIXchange

I’ve once again been working with teams that are updating various certification tests. I enjoy the interaction with tech pros from around the world as we devise test questions and answers.

As I wrote in my previous post on this topic:

The first thing I noticed was the strict confidentiality required for all team members. We were not to discuss questions or answers with anyone outside of the team for any reason. The last thing we want to do is allow a test taker to get access to the questions and answers. If people are able to cheat their way through an exam, it lessens the value of the certification for those who pass the exam legitimately.

Detecting cheating, or “non-independent test taking” (NITT), has become an even bigger deal since the time I wrote those words:

NITT is any circumstance when an exam is not taken independent of all external influence or assistance.

Non-Independent Test Taking (NITT) is a breach of IBM Test Security and is a serious violation of IBM Professional Certification Testing Practices

If you have taken an IBM certification exam, and it is determined you did NOT test independently: Your certification (if awarded) will be revoked; resulting in the loss of your certified status. You will be banned from testing, and will not be allowed to take any IBM test.

BEFORE TESTING

DON’T:
1. Use any unauthorized study guides, or other materials, that include the actual certification test questions.
2. Have someone else take the exam for you.

DURING TESTING

DON’T:
1. Talk to others who are testing, or look at their screen
2. Use written notes, published materials, testing aids, or unauthorized material.

AFTER TESTING

DON’T:
1. Disclose any test content.
2. Reproduce the test.
3. Take any action that would result in providing assistance or an unfair advantage to others.

Detecting Non-Independent Test Taking

IBM (and many IT certification programs) has devised methods to detect the use of resources containing IBM certification test questions. Through complex data forensics, we can identify a NITT violation. The forensic analysis is based on a variety of factors and different elements of the testing results. (IBM does not rely on any single piece of data.)

It is important that in addition to a standard review of the overall test results, there are multiple aspects of response patterns that are analyzed. The psychometrics of the test performance is evaluated. And IBM also includes independent statistical analysis in making the determination. Based on this rigorous evaluation, IBM can make the NITT decision with certainty and with an unchallengeable degree of confidence.

IBM takes notification of NITT very seriously. A notice is sent only when the conclusion is unmistakable.

By sending a violation notice, IBM has determined, undoubtedly, the testing candidate had access to the test questions (and used the questions) to prepare for the exam. The status of Pass or Fail does not matter. Also, it is not relevant whether the use of questions, from the certification exam, was intentional (or unintentional). In all cases, the fact is the test-taker had reviewed the questions from the certification exam, prior to taking the test.

This video highlights some of the reasons you might want to become certified.

When others learn that I’m involved in writing certification test questions, the typical response is to jokingly ask for copies of the questions. While I get where the humor is coming from, it really isn’t funny to me, because I understand the value of an honestly earned certification. There are ramifications for asking for or distributing questions and answers, and they exist for good reason.

Calculating Hypervisor Memory Overhead

Edit: Some links no longer work.

Originally posted December 15, 2015 on AIXchange

A customer recently contacted IBM Support, wondering how much memory the hypervisor could be expected to consume in their real world environment. Even given how inexpensive memory has become and how convenient it is to add and modify partitions as needed, customers can benefit by planning for their expected workloads as well as their hypervisor and VIO server memory overhead.

Of course this is hardly a new topic. About a decade ago, the LPAR validation tool helped customers obtain this sort of information. This interesting article from 2004 mentions hypervisor memory overhead:

Aside from the memory configured for a partition, additional memory is used for the Hypervisor, translation control entries (TCE) memory and page tables. The Hypervisor is firmware on the pSeries LPAR-capable systems, which helps ensure that partitions only use their authorized resources. When a pSeries system is running with partitions, 256 MB of memory is used by the Hypervisor for the entire system. There’s additional overhead memory called TCE, which is used for direct memory access (DMA) for I/O devices. For every four I/O drawers on a pSeries system, 256 MB of memory is allocated for TCE memory. The TCE memory isn’t additional overhead specific to partitions. Even AIX systems without partitions use TCE memory, but it’s included in the AIX system memory. Page tables are used to map physical memory pages to virtual memory pages. Like the TCE tables, page tables aren’t a unique overhead for LPARs. In other AIX non-partitioned systems, this overhead memory is part of the memory that AIX allocates at boot. Each partition needs 1/64th of its memory size, rounded to a power of two, for page table space in memory. The amount of page table space that’s allocated is based on the maximum memory setting in the partition’s profile.

Now here’s a recent article on hypervisor page table entries:

When setting memory values, it’s important to remember that the size of the Hypervisor page table (HPT) entries that are used to keep track of the real memory to virtual memory mappings for the LPAR is calculated based on maximum, not desired, memory. This means that common sense needs to be applied to setting maximum memory for an LPAR or Hypervisor memory overhead will be much higher than necessary.

This is the response my customer received from IBM:

The exact algorithm to calculate the amount of memory reserved for PHYP is proprietary information and I cannot send that to you. The official method for calculating that is the “IBM System Planning Tool.” You should adjust the system plan with your information.

Let’s try to do that calculation for your current configuration first:

1. On the HMC select the server in question and from Tasks do Configuration -> System Plans -> Create
2. Once created from the left pane of your HMC select System Plans.
3. Select the created system plan and from tasks -> Export System Plan
4. On the new window select “Export to this computer from the HMC” radio buttion and click OK

Are you doing this calculation to get an idea of hypervisor overhead on your systems, or do you simply make sure the system has plenty of available memory and keep in mind that some percentage will be going to system overhead?

The Costs of Technical Debt

Edit: Still an important concept to understand.

Originally posted December 8, 2015 on AIXchange

As often as I see it, it still surprises me when I encounter a company that depends on some application, but chooses to run it on unsupported hardware without maintenance agreements and/or vendor support. If anything goes sideways, who knows how they will stay in business.

Another situation that isn’t uncommon involves time-sensitive projects, new builds where settings or changes are identified and added to a change log. It’s supposed to get taken care of in a few days, but you know the drill. Somehow the changes aren’t made, and before you know it the machine is now production. The build process is over and users are on to testing or development.

Then there are the innumerable enterprises that continue to run old hardware, old software, old operating systems or old firmware. Why is this the case? Are business owners not funding needed updates and changes? Is it a vendor issue? Sometimes vendors go out of business or discontinue support of back versions of their solutions. In smaller shops, maybe one tech cares for the system, and no one else has any idea what’s being done to keep things running. This becomes a problem if that one tech leaves. Then there’s the all-purpose excuse: “If it isn’t broke, why fix it?”

There’s actually a name for this: technical debt:

Technical debt (also known as design debt or code debt) is a recent metaphor referring to the eventual consequences of any system design, software architecture or software development within a codebase. The debt can be thought of as work that needs to be done before a particular job can be considered complete or proper. If the debt is not repaid, then it will keep on accumulating interest, making it hard to implement changes later on. Unaddressed technical debt increases software entropy.

Analogous to monetary debt, technical debt is not necessarily a bad thing, and sometimes technical debt is required to move projects forward.

As a change is started on a codebase, there is often the need to make other coordinated changes at the same time in other parts of the codebase or documentation. The other required, but uncompleted changes, are considered debt that must be paid at some point in the future. Just like financial debt, these uncompleted changes incur interest on top of interest, making it cumbersome to build a project. Although the term is used in software development primarily, it can also be applied to other professions.

It’s hardly a new term, either. Although this piece, from 2003, focuses on the process of writing software, I think it’s applicable to other areas of IT as well.

Technical Debt is a wonderful metaphor developed by Ward Cunningham to help us think about this problem. In this metaphor, doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort that we have to do in future development because of the quick and dirty design choice. We can choose to continue paying the interest, or we can pay down the principal by refactoring the quick and dirty design into the better design. Although it costs to pay down the principal, we gain by reduced interest payments in the future.

The metaphor also explains why it may be sensible to do the quick and dirty approach. Just as a business incurs some debt to take advantage of a market opportunity developers may incur technical debt to hit an important deadline. The all too common problem is that development organizations let their debt get out of control and spend most of their future development effort paying crippling interest payments.

The tricky thing about technical debt, of course, is that unlike money it’s impossible to measure effectively.

The same article cites this 1992 report. (Funny how as quickly as business computers evolve, some of the underlying issues of using them remain with us.)

Shipping first time code is like going into debt. A little debt speeds development so long as it is paid back promptly with a rewrite…. The danger occurs when the debt is not repaid. Every minute spent on not-quite-right code counts as interest on that debt. Entire engineering organizations can be brought to a stand-still under the debt load of an unconsolidated implementation, object- oriented or otherwise.

Here’s more from the wikipedia link:

“It is useful to differentiate between kinds of technical debt. Fowler differentiates “Reckless” vs. “Prudent” and “Deliberate” vs. “Inadvertent” in his discussion on Technical Debt quadrant.”

There’s also this:

The concept of technical debt is central to understanding the forces that weigh upon systems, for it often explains where, how, and why a system is stressed. In cities, repairs on infrastructure are often delayed and incremental changes are made rather than bold ones. So it is again in software-intensive systems. Users suffer the consequences of capricious complexity, delayed improvements, and insufficient incremental change; the developers who evolve such systems suffer the slings and arrows of never being able to write quality code because they are always trying to catch up.

Finally, this article argues that we aren’t making the leaps and bounds in computing we once did, in part because of technical debt.

A decade ago virtual reality pioneer Jaron Lanier noted the complexity of software seems to outpace improvements in hardware, giving us the sense that we’re running in place. Our computers, he argued, have become more complex and less reliable. We can see the truth of this everywhere: Networked systems provide massive capacities but introduce great vulnerabilities. Simple programs bloat with endless features. Things get worse, not better.

Anyone who’s built a career in IT understands this technical debt. Legacy systems persist for decades. Every major operating system — desktop and mobile — has bugs so persistent they seem more like permanent features than temporary mistakes. Yet we constantly build news things on top of these increasingly rickety scaffolds. We do more, so we crash more — our response to that has been to make crashes as nearly painless as possible. The hard lockups and BSODs of a few years ago have morphed a momentary disappearance, as if nothing of real consequence has happened.

Worse still, we seem to regard every aspect of IT with a ridiculous and undeserved sense of permanence. We don’t want to throw away our old computers while they still work. We don’t want to abandon our old programs. Some of that is pure sentimentality — after all, why keep using something that’s slow and increasingly less useful? More of it reflects the investment of time and attention spent learning a sophisticated piece of software.

What are your thoughts? Is “good enough” actually good enough, or could we be doing more?

Moving an AIX System

Edit: Some links no longer work.

Originally posted December 1, 2015 on AIXchange

If you’re tasked with migrating, duplicating or cloning your system from old to new hardware, how do you go about it?

If the system isn’t too old, and your source systems are virtualized, you may be able to perform a live partition mobility operation. That process is non-intrusive enough that your users may not even realize there’s been a migration (although hopefully they’ll notice that things are running much faster on the new hardware).

Assuming the source system isn’t virtualized or is using internal disks, perhaps a fibre card is available. That way you could allocate some new LUNs to the source system, copy your rootvg data to them, and then swing the LUNs over to your destination. This method works — provided you do it correctly.

This technote covers some things to look for in IBM’s “Supported Methods of Duplicating an AIX System”:

Question: I would like to move, duplicate or clone an AIX system onto another partition or hardware. How can I accomplish this?

Answer: This document describes the supported methods of duplicating, or cloning, an AIX instance to create new systems based on an existing one. It also describes methods known to us that are not supported and will not work.

Q: Why Duplicate A System?

A: Duplicating an installed and configured AIX system has some advantages over installing AIX from scratch, and can be a faster way to get a new LPAR or system up and running.

Using this method customized configuration files, installation of additional AIX filesets, application configurations and tuning parameters can be set up once and then installed on another system or partition.

Supported Methods
1. Cloning a system via mksysb backup from one system and restore to new system.
2. Using the alt_disk_copy command.
3. Using alt_disk_mksysb to install a mksysb image on another disk.

Advanced Techniques
1. Live Partition Mobility
2. Higher Availability Using SAN Services

There are methods not described here, which have been documented by DeveloperWorks. Please refer to the document “AIX higher availability using SAN services” for details.

Non-Preferred Methods
There are other methods that in may not produce a bootable system under some scenarios. When used in a virtual environment or according to the IBM DeveloperWorks document mentioned above, they may be used to replicate or move a rootvg. However if used with directly attached disks (either internal or SAN-based) they may not work.

Some of these methods are:
1. Using a bitwise copy of a rootvg disk to another disk.
2. Removing the rootvg disks from one system and inserting into another.

This also applies to re-zoning SAN disks that contain the rootvg so another host can see them and attempt to boot from them.

Q: Why don’t these methods work?

A: The reason for this is there are many objects in an AIX system that are unique to it; Hardware location codes, World-Wide Port Names, partition identifiers, and Vital Product Data (VPD) to name a few. Most of these objects or identifiers are stored in the ODM and used by AIX commands.

If a disk containing the AIX rootvg in one system is copied bit-for-bit (or removed), then inserted in another system, the firmware in the second system will describe an entirely different device tree than the AIX ODM expects to find, because it is operating on different hardware. Devices that were previously seen will show missing or removed, and the system may fail to boot with LED 554 (unknown boot disk).

Feel free to share your own migration practices in comments.

Getting Started with Spectrum Scale

Edit: Some links no longer work.

Originally posted November 24, 2015 on AIXchange

IBM recently published — and just updated — a Redbook that covers IBM Spectrum Scale (formerly GPFS).

This IBM Redbooks publication updates and complements the previous publication: Implementing the IBM General Parallel File System in a Cross Platform Environment, SG24-7844, with additional updates since the previous publication version was released with IBM General Parallel File System (GPFS). Since then, two releases have been made available up to the latest version of IBM Spectrum Scale 4.1. Topics such as what is new in Spectrum Scale, Spectrum Scale licensing updates (Express/Standard/Advanced), Spectrum Scale infrastructure support/updates, storage support (IBM and OEM), operating system and platform support, Spectrum Scale global sharing – Active File Management (AFM), and considerations for the integration of Spectrum Scale in IBM Tivoli Storage Manager (Spectrum Protect) backup solutions are discussed in this new IBM Redbooks publication.

This publication provides additional topics such as planning, usability, best practices, monitoring, problem determination, and so on. The main concept for this publication is to bring you up to date with the latest features and capabilities of IBM Spectrum Scale as the solution has become a key component of the reference architecture for clouds, analytics, mobile, social media, and much more.

If you’re looking for a shorter time investment, check out this introductory video. It provides an overview of Spectrum Scale and its benefits. It runs about 6 minutes. There’s also a 2-part video series that goes into a little more detail. These vids run about 20 minutes each.

Part one covers the concepts and technology:

This is a technical introduction to Spectrum Scale FPO for Hadoop designed for those who are already familiar with HDFS concepts. Key concepts such as GPFS NSDs, Storage Pools, Metadata, and Failure Groups are covered.

Part two shows you how to set up a simple GPFS cluster:

This is a technical introduction to Spectrum Scale FPO for Hadoop designed for those who are already familiar with HDFS concepts. In this video, I show how the concepts from Part 1 can be applied with a demo of setting up a 2-node cluster from scratch.

I’ve actually been looking for an opening to write about this topic, because I’m seeing more customers running Spectrum Scale. If you’ve used it, please share your experiences in comments.

Replacing Disks with replacepv

Edit: Some links no longer work.

Originally posted November 17, 2015 on AIXchange

IBM developerWorks recently posted this piece about replacing a boot disk in PowerVC.

The developerWorks article mentions the replacepv command, using an example where this was run:

    replacepv hdisk0 hdisk1

I haven’t messed around with replacepv, but once I read about its capabilities, I was impressed:

The replacepv command replaces allocated physical partitions and the data they contain from the SourcePhysicalVolume to DestinationPhysicalVolume. The specified source physical volume cannot be the same as DestinationPhysicalVolume.

Note:
    The DestinationPhysicalVolume must not belong to a volume group.
    The DestinationPhysicalVolume size must be at least the size of the SourcePhysicalVolume.
    The replacepv command cannot replace a SourcePhysicalVolume with stale logical volume unless this logical volume has a non-stale mirror.
    You cannot use the replacepv command on a snapshot volume group or a volume group that has a snapshot volume group.
    Running this command on a physical volume that has an active firmware assisted dump logical volume temporarily changes the dump device to /dev/sysdumpnull. After the migration of logical volume is successful, this command calls the sysdumpdev -P command to set the firmware assisted dump logical volume to the original logical volume.
   The VG corresponding to the SourcePhysicalVolume is examined to determine if a PV type restriction exists. If a restriction exists, the DestinationPhysicalVolume is examined to ensure that it meets the restriction. If it does not meet the PV type restriction, the command will fail.

The allocation of the new physical partitions follows the policies defined for the logical volumes that contain the physical partitions being replaced.

-f Forces to replace a SourcePhysicalVolume with the specified DestinationPhysicalVolume unless the DestinationPhysicalVolume is part of another volume group in the Device Configuration Database or a volume group that is active.
-R dir_name Recovers replacepv if it is interrupted by <ctrl-c>, a system crash, or a loss of quorum. When using the -R flag, you must specify the directory name given during the initial run of replacepv. This flag also allows you to change the DestinationPhysicalVolume.

Some of you may be wondering what took me so long to get on board with replacepv. This functionality has been around awhile now (see here and here). Maybe I heard about it, and forgot. I have done the same type of thing using migratepv or running mirrorvg (though in the latter case requires the extra step of breaking the mirror by removing logical volumes from the disk I wanted to remove).

Going forward though, I’ll be sure to add this to my bag of tricks. I would encourage anyone else who hasn’t used replacepv to do the same.

A Different View of Virtualization

Edit: Still worth considering, and 96G is still pretty small.

Originally posted November 10, 2015 on AIXchange

This article examines the issues VMware and x86 customers face as they try to virtualize their environments:

Server virtualization has brought cost savings in the form of a reduced footprint and higher physical server efficiency along with the reduction of power consumption.

Obviously, we in the Power systems world can take this statement to heart. By reducing our physical server count and consolidating workloads, we can save on power and cooling and all of the other physical things we need for our systems (including network ports, SAN ports, cables, etc.).

A non-technical driver may be the workload’s size. If an application requires the equivalent amount of compute resources as your largest VM host, it would be cost prohibitive to virtualize the application. For instance, a large database server consumes 96 GB of RAM, and your largest physical VM host has 96 GB of RAM. The advantages of virtualization may not outweigh the cost of adding a hypervisor to the overhead of the workload.

One last non-technical barrier is political issues surrounding mission-critical apps. Even in today’s climate, there’s a perception by some that mission-critical applications require bare-metal hardware deployments.

I found this interesting since 96 GB of memory isn’t a lot on today’s Power servers. In addition, with the scaling in both memory and CPU, we can assign some very large workloads to our servers. Though the need to assign physical adapters exclusively to an LPAR is far less than it once was, we still have the option to use the VIO server for some workloads and physical adapters for others. Alternatively, we can use virtual for network and physical for SAN, or vice versa. With this flexibility, we can mix and match things as needed and make changes dynamically. It’s another advantage to running workloads on Power:

It would be unrealistic to think the abstraction that enables the benefits of virtualization doesn’t come at a cost. The hypervisor adds a layer of latency to each CPU and I/O transaction. The more intense the application performance requires, the more impact to the latency.

Since Power Systems are always virtualized, the hypervisor is always running on the system. The chips and the hypervisor are designed for virtualization. The same company designs the hardware, virtualization layer and the operating system. Everything works hand in hand. Even a single LPAR running on a Power frame runs the same hypervisor under the covers. We simply don’t see the kinds of performance penalties that VMware users do:

However, these direct access optimizations come at a cost. Enabling DirectPath I/O for Networking for a virtual machine disables advanced vSphere features such as vMotion. VMware is working on technologies that will enable direct hardware access without sacrificing features.

The same argument around Live Partition Mobility (LPM) could be made for Power systems that have been built with dedicated adapters. The nice thing is that on the fly we can change from physical adapters to virtualized adapters, run an LPM operation to move our workload to another physical frame, and then add physical adapters back into the LPAR. The flexibility we get with dynamic logical partitioning (DLPAR) operations allows us to add and remove memory, CPU, and physical and virtual adapters from our running machine.

As a quick aside, I expect to see even more blurring of the ways we virtualize our adapters as we continue to adopt SR-IOV:

SR-IOV allows multiple logical partitions (LPARs) to share a PCIe adapter with little or no run time involvement of a hypervisor or other virtualization intermediary. SR-IOV does not replace the existing virtualization capabilities that are offered as part of the IBM PowerVM offerings. Rather, SR-IOV compliments them with additional capabilities.

Getting back to the article on VMware and x86 customers, I was surprised by the conclusion. Most of my Power customers are able to virtualize a very high percentage of their workloads:

Complex workloads can challenge the desire to reach 100% virtualization within a data center. While VMware has closed the gap for the most demanding workloads, it may still prove impractical to virtualize some workloads.

Have you found the overhead associated with hypervisors a hindrance to virtualizing your most demanding workloads?

I’d like to pose these questions to you, my readers. How much of your workloads are virtualized? Do you even consider hypervisors or overhead when you think about deploying your workloads on Power?

A List of System Scanning Tools

Edit: Some links no longer work.

Originally posted November 3, 2015 on AIXchange

What kinds of tools do you use to document and check your systems? I’ve written about prtconf, a built-in tool, and hmcscanner, but many other solutions are available.

Here are three software tools that readers have shared with me. I’m not endorsing any of them; my hope is that by listing a few solutions in one place, it will help you conveniently research new options for your own environments.

systemscanaix:

SystemScan AIX can help by identifying problems, mistakes, and omissions made during the build phase, helping you to improve the security, performance, and serviceability of your systems.

(It) consists of a single RPM that can be installed on AIX 5.3, 6.1, or 7.1. It also has separate modules for HMC/IVM, and VIOS, that can be run from cron and silently produce system configuration reports that can then be transferred to another server for analysis.

For details, see the sample report and FAQs.

aixhealthcheck:

AIX Health Check is software that scans your AIX system for issues. It’s like an automated AIX check list. Download it from our website, unpack and run it on your AIX server and receive a full report in minutes. You decide the format: Text, HTML, CSV or XML output. Have the report emailed to you if you like. AIX Health Check is designed to help you pro-actively detect configuration abnormalities or other issues that may keep your AIX system from performing optimally.

See the sample reports and FAQs for more.

cfg2html, a free tool:

Cfg2html is a UNIX shell script similar to supportconfig, getsysinfo or get_config, except that it creates a HTML (and plain ASCII) system documentation for HP-UX 10.xx/11.xx, Integrity Virtual Machine, SCO-UNIX, AIX, Sun OS and Linux systems. Plug-ins for SAP, Oracle, Informix, Serviceguard, Fiber Channel/SAN, TIP/ix, OpenText (IXOS/LEA), SAN Mass Storage like MAS, EMC, EVA, XPs, Network Node Manager and HP DataProtector etc. are included. The first versions of cfg2html were written for HP-UX. Meanwhile the cfg2html HP-UX stream was ported to all major *NIX platforms, LINUX and small embedded systems.

Some consider it to be the Swiss army knife for the Account Support Engineer, Customer Engineer, System Admin, Solution Architect etc. Originally developed to plan a system update, it was also found useful to perform basic troubleshooting or performance analysis. The production of nice HTML and plain ASCII documentation is part of its utility.

Go here for additional information.

Feel free to use the comments to mention other tools and options.

HMC installios Cleanup

Edit: Some links no longer work. Some updates at the bottom.

Originally posted October 27, 2015 on AIXchange

Awhile back, I was called in to assist an IBM i heritage customer that encountered difficulty installing a VIO server from their HMC.

Fortunately, this support document had some helpful information:

This document describes how to cleanup HMC installios after a failure or interruption of the command.

HMC installios process failed or was interrupted before completing, and subsequent installios command fails with a permission error, such as “/tmp/installios.lock : print Operation not permitted.”

1. If a problem occurred during the installation and installios did not automatically unconfigure itself, run the following command to manually unconfigure the installios installation resources.

    installios -u

Some times the command may fail with a “Permission Denied” error or an error similar to the one below. If it does, proceed with the remaining procedure.

    hscroot@hostname:~> installios -u
    nimol_config MESSAGE: Unconfiguring the NIMOL server…
    nimol_config ERROR: The file /etc/nimol.conf does not exist.
    nimol_config MESSAGE: Running undo…
    ERROR unconfiguring nimol_config.

2. Check if any of the following exist. If so, they need to be removed:

    /tmp/installing.lock
    /tmp/installios_cfg.lock
    /tmp/installios_pid

To remove the file(s), you must obtain a “temporary” PESH access code to gain root access by contacting an HMC Software Support Representative at 1-800-IBM SERV. You will need the HMC serial number. …

Once you have root access to the HMC, change the file(s) permissions by running:

    chmod 775 /tmp/<filename>

At this point, you can try ‘installios -u’ again or manually remove the file(s). Then try the installation again.

HMC 7.3.4 has a known issue with lpar_netboot command creating log files in /tmp such that later execution will cause a log file collision resulting in a failure due to permission error. The fix is in HMC 7.3.5 with (mandatory fix) PTF MH01197. For more details, please, contact an HMC Software Support Representative.

In our case, cleaning up from the installation was as simple as running installios –u and then retrying the operation. Sure enough, on the retry, it again hung partway through the install. I guessed that this was the point where the previous attempt had been aborted.

On the HMC I was able to look at the log file:

    /var/log/nimol.log

I found that the install got this far:

    2015-08-13T06:14:35.088694-05:00 ioserver nimol: ,info=initialization
    2015-08-13T06:14:36.037522-05:00 ioserver nimol: ,info=verifying_data_files
    2015-08-13T06:14:41.084288-05:00 ioserver nimol: ,info=prompting_for_data_at_console
    It LPAR was hung at LED 0c48

I was able to open a console to the LPAR and then select the LUN that the VIO server would be installed to. In this case the LUN was being reused, and the installer recognized that a rootvg was already there. Rather than simply auto-overwrite the LUN, we received a warning prompt. It was making sure we actually wanted to overwrite it. I found this behavior pretty slick.

In general, I prefer NIM for installing VIOS, but in this case the alternative was the best choice, given the overall expertise of the people doing the installation. For an IBM i team with no knowledge of AIX or the NIM server, NIM would have been too much trouble.

——-

EDIT: This was where the original post ended. I got an email from an old co-worker from my days at IBM, Vic Walter. He gave me permission to share our conversation.

Hey Rob,

                I hope all is well with you. I am having issues with VIO installs via HMC image failing and ran across your article.

                When I run the cmd…

                                installios -F -e -R default1

                I get an error message….

                                ERROR removing default1 label in nimol_config.

                And am not finding anything about where nimol_config is

                Maybe you can help ?  thx

——-

I replied with:

Were you able to get the PESH password from IBM?  Seems like they would be able to help?  I guess I would run a find command to see if I could find the file..

——-

He replied with:

I do have a case open with IBM and did get the pesh passwords, even running the installios cmd as root also fails.

Find as root did find it.  /usr/sbin/nimol_config is a script, but has no default1 reference in it.

[hmc1 /] # grep default /usr/sbin/nimol_config

                               \rdefaults:

                               \r\t-L    default

                msg “No NIMOL server hostname specified, using %s as the default.\n” “$NIMOL_SERVER”

# Specify the defaults if variables aren’t set.

[[ -z ${LABEL} ]] && LABEL=”default”

——-

After some back and forth, he sent me an update from IBM

——-

Hi Rob,

                Sorry for the delay in responding. IBM’s solution was to shell into the HMC as hscpe (with pw they provided) su – root and run these cmds.

Once you login as root, first perform cleanup of the previous installios attempts with the below commands:
installios -F -e -R default1
installios -u
check for the below lock files and remove them if exist:
ls -la /tmp/installing.lock
ls -la /tmp/installios_cfg.lock
ls -la /tmp/installios_pid

                In my case the 3 files were present and I removed them. After that the HMC was rebooted by one of the AIX admins before I could get back to the VIO install.

When I did get back to the VIO installs all went well.

One other issue is the NIC used in the failing VIO install was not able to network boot off of the HMC for some reason. I borrowed the NIC from the other VIO to complete the install and this is when the failures appeared. This failure of the network boot could have been the original cause of the VIO install fail and incomplete cleanup. I am not real sure here. This is a new frame, but it also was not seeing one of the internal NVMe disks. One slot had “unknown” instead of the usual 800 GB NVMe description. I had the IBM CE reseat things and run diag on the box. He did find the drive not seated properly and otherwise found no issues.

——-

The main reason I wanted to document this was so that in the future, if this post comes up in your search, there will be another option for you to try.

An Underutilized PowerHA Option

Edit: Some links no longer work.

Originally posted October 20, 2015 on AIXchange

Awhile back, IBM’s Chris Gibson offered a PowerHA tip that you might have missed:

You can use SEA poll_uplink method (requires VIOS 2.2.3.4). In this case SEA can pass up the link status, no “!REQD” style ping is required any more.

Yes, you can install VIOS 2.2.3.50 on top of 2.2.3.4.

At the moment I’m not aware any official documentation regarding how to configure SEA poll_uplink in PowerHA environment. I was in touch with Dino Quintero (editor of the PowerHA Redbooks) and his team will update the latest PowerHA Redbook with this information soon.

However, it’s very easy to enable SEA poll_uplink in PowerHA. Configuration steps:

* Enable poll_uplink on ent0 interface (run this command for all virtual interfaces on all nodes):
    # chdev -l ent0 -a poll_uplink=yes -P
* This change requires a reboot.
Check ent0 and the uplink status:
    # lsattr -El ent0 | grep poll_uplink
    poll_uplink yes Enable Uplink Polling True
    poll_uplink_int 1000 Time interval for Uplink Polling True
    # entstat -d ent0 | grep -i bridge
    Bridge Status: Up

* Enable poll uplink in CAA / PowerHA:
     # clmgr -f modify cluster MONITOR_INTERFACES=enable
* Run cluster verification and synchronization.
* Finally, start PowerHA cluster.

In response, another IBMer, Shawn Bodily, tweeted that he’d updated the PowerHA wiki with this information.

That prompted Chris to post this information:

From Chris we read:

I wanted to mention a new AIX feature, available with AIX 7.1 TL3 (and 6.1 TL9) called the “AIX Virtual Ethernet Link Status” capability. Previous implementations of Virtual Ethernet do not have the ability to detect loss of network connectivity.

For example, if the VIOS SEA is unavailable and VIO clients are unable to communicate with external systems on the network, the Virtual Ethernet adapter would always remain “connected” to the network via the Hypervisors virtual switch. However, in reality, the VIO client was cut off from the external network.

This could lead to a few undesirable problems, such as, a) needing to provide an IP address to ping for Etherchannel (or NIB) configurations to force a failover during a network incident, lacking the ability to auto fail-back afterwards, b) unable to determine total device failure in the VIOS and c) PowerHA fail-over capability was somewhat reduced as it was unable to monitor the external network “reach-ability.”

The AIX VEA Link Status feature provides a way to overcome the previous limitations. The new VEA device will periodically poll the VIOS/SEA using L2 packets (LLDP format). The VIOS will respond with its physical device link status. If the VIOS is down, the VIO client times out and sets the uplink status to down.

To enable this new feature you’ll need your VIO clients to run either AIX 7.1 TL3 or AIX 6.1 TL9. Your VIOS will need to be running v2.2.3.0 at a minimum (recommend 2.2.3.1). There’s no special configuration required on the VIOS/SEA to support this feature. On the VIO client, you’ll find two new device attributes that you can configure/tune. These attributes are:

    poll_uplink (yes, no)
    poll_uplink_int (100ms – 5000ms)

Here’s some output from the lsattr and chdev commands on my test AIX 7.1 TL3 partition that show these new attributes.

   # oslevel -s
   7100-03-01-1341
   # lsattr -El ent0 | grep poll
   poll_uplink     no             Enable Uplink Polling                    True
   poll_uplink_int 1000          Time interval for Uplink Polling           True
   # lsattr -El ent0 -a poll_uplink
   poll_uplink no Enable Uplink Polling True
   # lsattr -Rl ent0 -a poll_uplink
   no
   yes
   # lsattr -Rl ent0 -a poll_uplink_int
   100…5000 (+10)

Although Chris first mentioned this in March and brought it up again this summer, I’m not sure many of you are aware of this option. Even some PowerHA guys I reached out to didn’t know about it, so this information seems well worth sharing.

The Simplest Script

Edit: Send me your scripts.

Originally posted October 13, 2015 on AIXchange

I was recently working with someone who had built some new LPARs. As part of the build out he decided his NIM server would make a good general purpose utility server. This NIM server would become a one-stop shop where he planned to stage fixes along with the base OS images he’d use to create his environment.

During the build out, he needed to get console access to servers so he could, for example, configure networking. That meant logging into the HMC and then running vtmenu. However, this extra step of logging into the HMC was taking too long.

He set it up so that he could ssh with keys to all of the LPARs in the environment, including the VIO servers and the HMCs from the NIM server. This became his central point of control. He could get anywhere by just logging into the NIM server first. (Obviously it then becomes critical to lock down NIM server access to prevent individual users from freely roaming this environment, but this can be accomplished easily enough.)

While these articles (herehere and here) note that vtmenu works fine for getting a console, it’s actually my preferred method of gaining console access. But why go to the hassle of logging into your HMC if you can just do it from your utility server?

Always interested in saving extra steps, my colleague went ahead and set up a simple script on his utility LPAR. Let me emphasize the word “simple” — this script is just a single line in /usr/local/bin:

    ssh -t hscroot@<hmc-ip-address> “vtmenu”

This works because he can log into the HMC without a password using his ssh keys. It brings him directly to the list of managed servers that you’d expect. From there, he can pick the frame and LPAR he wants to see. (Note: Of course, <hmc-ip-address> would need to be replaced with your actual HMC IP address for use in your environment.)

One way this could be further automated is to provide the capability to go through the script on the HMC /usr/hmcrbin/vtmenu and find different commands to run. For example:

    lssyscfg -r sys –Fname

      lssyscfg -m <machine name> -r lpar  -Fname

These commands would enable your own commands to run as they do in the script:

    mkvterm -m <machine name> -p <partition_name>

While further modifications weren’t needed in this case, I’d still like to see something that behaves this way. So if you’re willing to share your own time-saving scripts, I’d love to take a look. You may not consider your scripts to be suited for anything other than what you’re doing, but that’s not necessarily the case. We can all learn from one another.

IBM Announcements Including AIX 7.2 and New Linux Servers

Edit: Some links no longer work.

Originally posted October 5, 2015 on AIXchange

What version of AIX are you running? At conferences and events where presenters ask for attendees to put up their hands I am seeing fewer and fewer shops running AIX 5.3, AIX 5.2 or older versions of AIX. They are migrating to newer hardware and newer versions of PowerVM and the AIX operating system.

I have been to some training lately that reiterated a good point, when you run AIX on IBM Power servers, you get your products and support from one vendor. Instead of worrying about compatibility among your server and your hypervisor and your operating system, or issues where overhead from your hypervisor can introduce performance penalties, you run an integrated stack from the hardware through the firmware into the operating system. You take advantage of IBM’s mainframe heritage and virtualization options that are designed into the hardware instead of bolted on in software. The massive memory bandwidth and threads per core that are available with POWER8 and the latest operating system versions that exploit the hardware are unmatched by the competition. In my training I heard some differentiators for AIX and Power compared to other operating systems and environments. AIX usually runs key workloads, while other operating systems are used for less-critical applications. AIX and POWER8 offer better performance and better scaling than other platforms. With PowerVM, organizations have many opportunities for server and workload consolidation with the ability to tightly “pack” these servers and run at high average utilizations.

Until now, our options for running AIX on POWER8 were AIX 5.2 and AIX 5.3 in versioned WPARs, AIX 6.1 and AIX 7.1.

IBM’s latest announcement brings us to AIX version 7.2, which will provide for Live Update for Interim Fixes, Server Based Flash Caching, 40G RoCE for Oracle RAC performance and vNIC adapters we can use with SRIOV adapters that will provide for quality of service (QOS) settings. The vNIC will also help us use SRIOV adapters with Live Partition Mobility, which was one of the drawbacks of SRIOV before. vNIC will be more efficient than using shared Ethernet adapters (SEA) in our VIO servers. vNIC will also work with AIX 7.1 TL4, so you do not necessarily need to upgrade to AIX 7.2 to take advantage of it.

AIX 7.2 still comes in two varieties: AIX Standard Edition (which includes Dynamic System Optimizer) and AIX Enterprise Edition. Dynamic System Optimizer will be included in the base OS to help us with system tuning, especially on the larger multi-CEC systems. AIX Enterprise edition consists of everything that you find in AIX Standard Edition, but it also includes other IBM software you can use to manage your environment including:

  • PowerVC
  • PowerSC Standard Edition
  • AIX Dynamic System Optimizer
  • Cloud Manager with OpenStack for Power V4.3
  • IBM Tivoli Monitoring V8.1
  • IBM BigFix Lifecycle V9.2

One interesting feature that is coming along is the ability to live update service packs and technology levels. The current AIX hotpatch technology (available since AIX 6.1) is great for certain isolated ifixes, but is not extensible to service packs or technology levels. AIX 7.2 Live Update is a new approach that initially supports only ifixes, but is designed to be extensible to service packs and technology levels in the future.

We will be able to use Coherent Accelerator Processor Interfaces (CAPI) with AIX 7.2, which up to now it has been Linux only. I expect to see more hardware taking advantage of CAPI in the future. By using CAPI, we can reduce the number of instructions that we need to do I/O, instead of talking to an interface and using those drivers, we are going directly from the CPU to a flash storage array, for example.

There will be two different features related to SSDs: LVM Mirroring to Flash and Server Based Flash Caching. These are the key distinctions:
LVM Mirroring to flash uses existing LVM mirroring capability to mirror slower spinning storage to high-speed SSD storage, and then we can specify that the SSD is the preferred mirror for reads. This implies that the SSD must have the same capacity as the spinning storage. This is implemented on both 7.1 (already available in TL3 SP4), and 7.2
Server Based Flash Caching is the ability to use a smaller SSD as a cache for larger and slower spinning storage. This does not rely on LVM mirroring (the storage does not have to be mirrored). Unless you need a full mirror, this would be a more cost-effective solution than mirroring since it does not require as much SSD capacity, but it will provide a similar performance benefit. This is an AIX 7.2-only feature (at least for now).

Other Announcements

Also in these announcements, we find that the newest release of PowerKVM, v3.1, will run in little endian mode. There will be vCPU and memory hot plug support, dynamic micro threading, and SRIOV support.

On the hardware side new Linux only machines were announced:

  • S822LC for High Performance Computing is a 2-socket 2U system with two NVIDIA GPUs.
  • S822LC for Commercial Computing is a 2-socket 2U system with no GPU. It will have up to 20 cores and 1 TB memory with five PCIe slots, four of which are CAPI enabled.
  • S812LC is a 1-socket 2U system with up to 14 large form factor disk drives, which provides for 84 TB of on-board storage. This machine supports up to 10 cores and 1 TB memory with four PCIe slots, two of which are CAPI enabled.

The LC system portfolio will be different from other scale-out servers. Customers will have access to pricing and configurations and will purchase directly from the Web, although they are still welcome to engage with a business partner to help them with their machines. IBM states that it is simple to order these systems. They come with a three-year 8×5 warranty with 100 percent client replaceable parts. Six configurations are available. These systems should be available Oct. 30.

As always, IBM is committed to bringing new hardware and operating system features to its customers, and this announcement is no exception.

For more on these announcements, check out:

Jay Kruemcke’s blog “AIX 7.2 and October Power software announcements”

Recently updated IBM AIX – From Strength to Strength document

Announcement letter: IBM AIX 7.2 delivers the reliability, availability, performance, and security needed to be successful in the new global economy

list of all of today’s announcements

Displaying Virtual Optical Device Info with lsvopt

Edit: Some links no longer work.

Originally posted September 29, 2015 on AIXchange

I have a client that works with virtual optical devices, having built one for each LPAR on its system. The client wanted to know the easiest way to display these devices along with all of the virtual media (both the media already loaded into the devices, and the media available to load).

I’ve covered this topic before (see here and here), but it’s worth revisiting.

The client created the virtual optical devices using this command:

    mkvdev –fbo –vadapter vhostX –dev somename

Media from the DVD drive was copied using this command:

    mkvopt –name cdname.iso –dev cd0 –ro

Media was verified with the lsrep command, which displays the size of the client’s virtual media repository, along with the names, sizes and access of all the .iso images (either ro or rw). (Note: I recommend monitoring the size of your own media repository, particularly if you plan on adding more media.) While similar information can be found with ls –la /var/vio/VMLibrary, lsrep seems a bit more user friendly.

On this project, I worked directly with a guy with an IBM i background and limited familiarity with the VIO server. In my experience it seems that people accustomed to IBM i tend to look for an HMC GUI method to manipulate the VIO server or some other easier way of doing things compared to messing around with UNIX command line stuff. In this instance, he was trying to avoid a couple of common uses of the lsmap command:

* lsmap -vadapter vhostX — This would require him to specify the vadapter parameter and go through the adapters one by one.

* lsmap –all | more — He didn’t want to have to scroll through all of the resulting output.

Fortunately, the lsvopt command provided the alternative to all that pain. With lsvopt, he could inventory the virtual media devices, displaying the name of each device, the media that was loaded, and the size of the loaded media.

Since I mentioned it, note that lsvopt is also a handy when it comes to VIO server upgrades. See this section of the release notes:

Before installing the VIOS Update Release 2.2.3.50
The update could fail if there is a loaded media repository.

Checking for a loaded media repository
To check for a loaded media repository, and then unload it, follow these steps.
    To check for loaded images, run the following command: $ lsvopt
    The Media column lists any loaded media.
    To unload media images, run the following commands on all Virtual Target Devices that have loaded images: $ unloadopt -vtd <file-backed_virtual_optical_device>
    To verify that all media are unloaded, run the following command again: $ lsvopt
    The command output should show No Media for all VTDs.

While a lot of you rely on lsmap, I still run into people who don’t know about lsvopt. Plus, a refresher never hurts.

The IBM Champion Program is Back

Edit: Now I am a Lifetime Champion. Some links no longer work.

Originally posted September 22, 2015 on AIXchange

Back in 2011 I wrote about the IBM Champion program and how happy I was to be one of those recognized. Since that time, the program went on a bit of a hiatus, and there hadn’t been any new nominations for Power Champions (my official designation). Occasionally, someone on Twitter or elsewhere online would ask about the program and when it would be revived.

I’m pleased to report that that time is now:

It’s IBM Champion Season! (nominations are open)

No, that doesn’t mean you get to hunt IBM Champions! What it means is that nominations are now open, so you can nominate IBM Champions for the following areas:

    IBM Social Business (AKA Lotus, ICS, ESS)
    IBM Power Systems
    IBM Middleware (AKA Tivoli, Rational, WebSphere)

When: From September 14 – October 31

How: https://ibm.biz/NominateChamps

The IBM Champion program recognizes innovative thought leaders in the technical community. An IBM Champion is an IT professional, business leader, or educator who influences and mentors others to help them make the best use of IBM software, solutions, and services, shares knowledge and expertise, and helps nurture and grow the community. The program recognizes participants’ contributions over the past year in a variety of ways, including conference discounts, VIP access, and logo merchandise, exclusive communities and feedback opportunities, and recognition and promotion via IBM’s social channels.

Contributions can come in a variety of forms, and popular contributions include blogging, speaking at conferences or events, moderating forums, leading user groups, and authoring books or magazines. Educators can also become IBM Champions; for example, academic faculty may become IBM Champions by including IBM products and technologies in course curricula and encouraging students to build skills and expertise in these areas.

Take the opportunity to nominate an influencer of IBM Social Business, IBM Power, or IBM Middleware, now. Nominations for the 2016 IBM Champion program will be accepted through Midnight Eastern Time, October 31st 2015.

Nominations for IBM Champion are open to candidates worldwide, and candidates can be self-nominated or nominated by another individual. IBM employees are not eligible for nomination.

Tips for a solid nomination:
* Be specific about contributions. They need to be verifiable by either a web search, or by someone at IBM who can confirm the contributions.
* It is not a popularity contest – more nominations does not necessarily boost your chances. Content of the nomination is vital.
* Include links to the nominee’s blog, if applicable for contributions.
* Include the nominee’s twitter handle, if they have one.
* Include the nominee’s email address.
* Stick to contributions for 2015. Nothing prior to that is relevant as contributions are assessed each year.

I’m excited to see the program is back, and I look forward to seeing who will soon be joining the ranks of IBM Power Champions. Who do you plan to nominate?

Sending Log Files to IBM

Edit: Still worth thinking about.

Originally posted September 15, 2015 on AIXchange

Are you sending in log files and snap files to IBM for problem analysis? I usually send my information via FTP, but lately I’ve tried other methods like https or the Java utility. For anyone who’s grown up with GUI, these may be more appealing options.

Learn more about updating PMRs by using the Enhanced Customer Data Repository:

Enhanced Customer Data Repository (ECuRep) is a secure and fully supported data repository with problem determination tools and functions. It updates problem management records (PMR) and maintains full data life cycle management.

This video provides further information:

What follows can be found via the send data tab:

Speed of transfer
While you may send data to any of our addresses, your speed of transfer will be quickest if you use choose the geographic location nearest your physical location.

Americas
The Java and z/OS utilities are fastest
The next fastest methods are FTP and FTPS
Server address: testcase.boulder.ibm.com

Asia Pacific
The Java and z/OS utilities are fastest
The next fastest methods are FTP and FTPS
Server address: ftp.ap.ecurep.ibm.com

Europe
The Java and z/OS utilities are fastest
The next fastest methods are FTP, FTPS and SFTP
Server address: ftp.ecurep.ibm.com

Use this chart to determine which method best suits your needs based on the size of the files you’re transferring. I’ve listed the information below, but believe me, it will make more sense when you consult the chart.

Available methods
If your file size is…
Greater than 2 gigabytes
Less than 2 gigabytes
Less than 20 megabytes

FTP
Yes, both regular and secure FTP methods are supported. Faster Yes, both regular and secure FTP methods are supported. Faster Yes, both regular and secure FTP methods are supported.

HTTPS
Only when using the widget on www.secure.ecurep.ibm.com. Yes, both regular and secure HTTP methods are supported, but we strongly encourage a file limit of 200 megabytes when transmitting data via HTTPS. Yes, both regular and secure HTTPS methods are supported.

Java utility
Yes, all data is transmitted securely using the Java utility. Faster Yes, all data is transmitted securely using the Java utility. Faster Yes, all data is transmitted securely using the Java utility.

Email
No. No. Yes, both regular and secure emails are supported.
1. Gather diagnostic data. Your IBM SSR will inform you what diagnostic data is required.
Your IBM SSR will provide you with a Problem Management Record number (PMR). Write this down.
2. Compress data All diagnostic data delivered electronically to IBM must be in a compressed or packed format following the IBM file naming conventions.

A problem record is identified by its ID which is built out of the PMR <xxxxx> or RCMS/CROSS number <xxxxxxx> , the branch office <bbb> (only mandatory for PMR ticket IDs), and the country code <ccc>.

-File naming convention for PMR tickets:
File naming convention: xxxxx.bbb.ccc.yyy.yyy
Example: 34123.055.724.Filename.zip (<PMR id>.<branch_office>.<country_code>.<filename>)

For further assistance, contact: contact@ecurep.ibm.com .

So for those of you who send in log files, have you been sticking with FTP and the command line, or have you tried another method?

A Troubleshooting Follow-up

Edit: More fun with zsnap.

Originally posted September 8, 2015 on AIXchange

Last week I wrote about the zsnap command and how it can be used to collect information and troubleshoot data for AIX, PowerHA or VIO server. Here’s how to use zsnap with PowerHA SystemMirror:

The following procedures are for data collection, not for problem diagnosis. Gathering this information before calling IBM support can aid in problem determination and save time resolving Problem Management Records (PMRs).

Using zsnap for PowerHA SystemMirror
Run # zsnap –HACMP
This zsnap command gathers PowerHA data and creates the testcase file in one step. If you already have a PMR number, see the example below.

Data
The zsnap command for PowerHA SystemMirror gathers the same information as snap at this time. The data include:

* Data from both nodes
* CAA data (PowerHA 7.1 and up)
* RSCT information (PowerHA 6.1 and lower)
* AIX information: bootinfo, lslpp, emgr, lsdev disk data, lspv lsvg, lsfs, mount, df, lscfg, lsattr on fibre channel adapter, process table, env data
* Network information: netstat -in, netstat -rn, netstat -v, netstat -m, lsdev adapter and interface data, tty, lsattr on network adapters ODM data for both PowerHA and AIX
* Error report
* Configuration files: clhosts, clinfo.rc, harc.net, netmon.cf, rhosts, clip_config,environment, inetd.conf, limits, profile, resolv.conf, snmpd.log, snmpdv3.log, filesystems, inittab, netprobe.log, rc.net, services, snmpd.peers, syslog.conf, clvg_config, hosts, ipHarvest.log, netsvc.conf, rc.nfs, snmpd.conf and snmpdv3.conf, ifrestrict
* AHAFS data
* PowerHA logs: autoclstrcfgmonitor.out, autoverify.log, cell temp log, clverify, clavan.log, cluster.log, clcomd.log, clcomddiag.log,clconfigassist.log, hacmp.out clstrmgr.debug, clstrmgr.debug.long clevents, clevmgrdevents, clinfo.log, clutils.log, clver_CA_daemon_invoke_client.log, clver_debug.log, cspoc.log, dhcpsa.log, dnssa.log, domino_server.log, emuhacmp.out, hacmprd_run_rcovcmd.debug, application monitor logs, smart assistant logs, smit.log, migration.log
* PowerHA data: hostname information, cllsif information, cluster state data, cluster daemon data, resource group information, cluster topology information

Example
See zsnap usage for all available options.
# zsnap –HACMP –pmr 12345,123,123
The example gathers the appropriate data and creates a testcase file with the IBM standard naming convention for quicker processing. You will be prompted to send the file to IBM using the FTP protocol. If you don’t have a PMR number, omit the –pmr flag to build the testcase file.
You can also run the zsnap command from the AIX SMIT menus.

Using snap for PowerHA SystemMirror
The snap command is the standard AIX tool that gathers data and stores that information in /tmp/ibmsupt/. The snap command does not gather the following additional PowerHA related information.

Data
See zsnap Data section above for the data collected by the snap command.

Sample snap procedure for PowerHA
See snap usage for all available options.

Follow these steps to gather the PowerHA data.
1. Run the snap -r command to remove all previously gathered data on all of the nodes in the cluster.
2. Gather the additional information and put it in /tmp/ibmsupt/testcase. You may need to recreate the testcase directory.
3. Run # snap -e on just one node.
4. Rename the testcase file to adhere to IBM testcase file naming conventions, and then send the file to IBM.

Although IBM Support will guide you through the process of collecting and sending data, it’s best to be proactive. You’ll generally resolve the issue more quickly if you do your own troubleshooting.

The First Step in Troubleshooting

Edit: Do you use snap or zsnap more often?

Originally posted September 1, 2015 on AIXchange

If you work on AIX (which you surely do if you’re reading this) and you’ve worked with IBM Support, you’ve probably used the snap command.

But are you familiar with the zsnap command?

The zsnap command is a supplemental tool used by AIX support personnel to gather debugging data. Built around the standard AIX snap command, the zsnap command gathers additional information that the snap command does not provide. You can also use the zsnap command to send a testcase directly to IBM from the machine that generated the testcase data. If needed, the zsnap command can fork multiple calls to the snap command, which results in quicker data gathering than if done via snap.

IBM has a web page that walks you through the troubleshooting process and also demonstrates the many uses of the zsnap command. This page brings you to the index:

    MustGather index
    Cluster AIX Aware problems [CAA]
    Filesystems
    JAVA on AIX
    Installation problems
    Logical Volume Manager problems
    NFS specific problems
    NIM problems
    PowerHA (HACMP) problems
    PowerVM Virtual I/O Server problems
    SAN or device I/O problems
    System crash
    TCP/IP problems

Each entry links to different procedures for gathering information. The MustGather index is a nice place to start if you’re unsure which zsnap options you should use, but all the links display different methods for using zsnap to collect information.

For example, the first entry with CAA issues states:

Using zsnap for CAA
Run # zsnap –CAA
This zsnap command gathers CAA data and creates the testcase file in one step. If you already have a PMR number, see the example below.

+Data
In addition to the information gathered by the snap command, the zsnap command gathers CAA data that include:

    bootstrap repository information
    detailed repository disk data
    CAA tunables data
    lscluster -i, -c, -s, -d, -m
    uname system information
    swinfo information
    CAA syslog log
    +Example

See zsnap usage for all available options.
# zsnap –CAA –pmr 12345,123,123

The example gathers the appropriate data and creates a testcase file with the IBM standard naming convention for quicker processing. You will be prompted to send the file to IBM using the FTP protocol. If you don’t have a PMR number, omit the –pmr flag to build the testcase file.
You can also run the zsnap command from the AIX SMIT menus.

Using snap for CAA
The snap command is the standard AIX tool that gathers data and stores that information in /tmp/ibmsupt/. There are two flags that can be used to gather CAA data with snap: snap caa or snap -e.

+Data
To reduce the possibility of needing to request additional information later, the following information needs to be gathered manually and included in the snap testcase file.
See zsnap Data section above for the information you need to collect.

+Sample snap procedure for CAA
See snap usage for all available options
Follow these steps to gather the CAA data.
1. Run the snap -r command to remove all previously gathered data.
2. Gather the additional information and put it in /tmp/ibmsupt/testcase. You may need to recreate the testcase directory.
3. Run # snap caa or snap -e
4. Rename the testcase file to adhere to IBM testcase file naming conventions, and then send the file to IBM.

Here are some specific zsnap commands you can use:

    For filesystems: zsnap –FS
    For installation issues: zsnap –INSTALL, or zsnap –NIM
    For LVM issues: zsnap –LVM

Each link gives you the data that is captured and examples for using the command. For completeness there is also:

    zsnap –SAN
    zsnap –NFS
    zsnap –DUMP
    zsnap –TCPIP

As you can see, zsnap is a valuable tool that can help you before you take your problem to IBM Support.

Helpful Links About Event Monitoring

Edit: Still an interesting concept.

Originally posted August 25, 2015 on AIXchange

On Twitter, Chris Gibson linked to this interesting post from Andrey Klyachin:

A colleague asked me, if there is an interface in AIX like inotify in Linux. He has some problem on one of his AIX boxes and wanted to monitor new files in a directory. Of course there is such interface since AIX 6.1 TL6 or AIX 7.1 – it is AHAFS. Not very well known AIX feature, used primarily by new PowerHA 7.1, but not by admins.

If you want to know more about the feature, I would suggest you first to read the IBM documentation. My example is just small practical example how to use the technology, not a manual about it.

The IBM documentation to which Andrey refers brings you to the Introduction to the AIX Event Infrastructure:

The AIX Event Infrastructure is an event monitoring framework for monitoring predefined and user-defined events.

In the AIX Event Infrastructure, an event is defined as any change of a state or a value that can be detected by the kernel or a kernel extension at the time the change occurs. The events that can be monitored are represented as files in a pseudo file system. Some advantages of the AIX Event infrastructure are:

  • There is no need for constant polling. Users monitoring the events are notified when those events occur.
  • Detailed information about an event (such as stack trace and user and process information) is provided to the user monitoring the event.
  • Existing file system interfaces are used so that there is no need for a new application programming interface (API).
  • Control is handed to the AIX Event Infrastructure at the exact time the event occurs.

Further in the documentation, we come to the infrastructure components:

The AIX Event Infrastructure is made up of the following four components:

  • The kernel extension implementing the pseudo file system.
  • The event consumers that consume the events.
  • The event producers that produce events.
  • The kernel component that serve as an interface between the kernel extension and the event producers.

From there, the doc covers setting up the Event infrastructure (which is basically installing bos.ahafs, creating the directory, and mounting it).

The high level view of how the AIX Event Infrastructure works says:

A consumer may monitor multiple events, and multiple consumers may monitor the same event. Each consumer may monitor value-based events with a different threshold value. To handle this, the AIX® Event Infrastructure kernel extension keeps a list of each consumer’s information including:

  • Specified wait type (WAIT_IN_READ or WAIT_IN_SELECT)
  • Level of information requested
  • Threshold (s) for which to monitor (if monitoring a threshold value event)
  • A buffer used to hold information about event occurrences.

Event information is stored per-process so that different processes monitoring the same event do not alter the event data. When a consumer process reads from a monitor file, it will only read its own copy of the event data.

Finally, the monitoring events section offers subsections on creating the monitor file, writing to the monitor file, reading event data, and more.

Relevant to the documentation is this typical workflow.

Now back to Andrey’s post. He’s written a perl script that notifies him when, for instance, someone changes the /root/smit.log file:

The procedure to create a new monitor is relatively simple. We have to create a new directory and to make a new .mon-file in the directory. In the file we write how much information do we need and some other flags. After that we read from the file, when a notification comes.

Let’s say we want to monitor file /root/smit.log and obtain a notification every time it is changed. We go to directory /aha/fs/modFile.monFactory – it is a standard directory for “File modification monitor”, and create a directory root there with mkdir command. Then we create smit.log.mon file in this directory and write CHANGED=YES;INFO_LVL=1 in this file. That’s it! After that the only thing we have to do is to wait, till some information comes.

And to think I found all this from a single tweet.

Check Out IBM Software System Maps

Edit: I still use these all the time

Originally posted August 18, 2015 on AIXchange

Say your site is getting new hardware. One thing you’d want to know is the software versions you should be running on your shiny new boxes.

That’s what makes IBM’s Software System Maps web page worth bookmarking. Here you’ll find software maps for AIX, IBM i, PowerVM VIO servers, SUSE Linux and RedHat Linux. There’s also a link for supported code combinations for HMC and server firmware.

When you select the AIX map, you’re brought to a list of Power systems. Pick a model and a machine type, and you’ll have a choice of configurations, whether you’re looking at virtual clients or clients that have access to physical I/O cards. For instance, when I selected the 8284-22A (S822) and all I/O, I found out which AIX versions were supported and at what levels. (AIX 7100-01-10, 7100-02-05 and 7100-03-03 are the supported base levels, but 7100-01-10, 7100-02-06 and 7100-030-04 are recommended. The same information is provided for AIX 6.1; however, AIX 5.3, AIX 5.2, AIX 5.1, AIX 4.3.3 and AIX 3.2.5 aren’t supported on this hardware.) The bottom of this page contains links to the fix level recommendation tool, Fix Central, end of support dates for AIX, etc.

I should stress that you don’t need new brand new hardware to make use of this tool. The AIX map supports older systems, including RS/6000s running AIX 4.3.3, AIX 5.1, etc. So if you’re still using those systems (perhaps you’re running some sort of technology museum?), you too can benefit from this capability.

And, as I mentioned, you can also do VIO server software mapping. Just select a system and find the VIO server versions that are verified, and the versions that are recommended. You can also see the versions that haven’t been verified to run on the hardware you’re interested in.

Clicking on the supported code combinations link brings you to the POWER code matrix page:

System Firmware is delivered as a Release Level or a Service Pack. Release Levels support the general availability (GA) of new function or features, and new machine types or models. Upgrading to a higher Release Level is disruptive to customer operations. IBM intends to introduce no more than two new Release Levels per year. These Release Levels will be supported by Service Packs. Service Packs are intended to contain only firmware fixes and not to introduce new function. A Service Pack is an update to an existing Release Level.

Note: Installing a Release Level is also referred to as upgrading your firmware. Installing a Service Pack is referred to as updating your firmware. For HMC-managed systems at or beyond System Firmware Release Level 230 (available May 2005), Service Pack updates can be concurrently installed. Concurrent installation minimizes or eliminates downtime needed to apply firmware patches. IBM cannot guarantee that all Service Packs can be installed concurrently, however, our goal is to provide non-disruptive installation of Service Packs.

Browse around the site. It’s kept up to date and has good reference material.

Creating Adapters with the HMC Enhanced GUI

Edit: Sometimes I miss the old interface.

Originally posted August 11, 2015 on AIXchange

I was recently playing around with the enhanced HMC GUI, using the new interface to look at an old test machine.

The test box had crash and burn LPARs that had been created over time. In some cases, I’d spin up a test LPAR and select the VIO server option that allowed for any client partition to connect to the virtual adapter I’d created. This was to allow greater flexibility going forward — it wouldn’t be necessary to  re-create the adapter; I’d just assign another client LPAR to the existing one. If I hadn’t yet built the client LPAR definitions, I’d set up a bunch of server adapters ahead of time for later use with the crash and burn client LPARs.

On the new HMC software version, when I selected the manage PowerVM option, some of the disks and adapters weren’t appearing in the PowerVM Virtual storage adapter view. Since I could see them using the classic HMC view, I figured it was a bug and opened a ticket with IBM Support.

After some back and forth, support sent me this interesting information:

Server Adapters in HMC can be created with the option of “Any” for the Client Adapter. Such adapters are not supported by REST or by the Enhanced+ GUI. This is by design. It is not possible to know to which client adapter it is connected to. The Server adapter mapping could possibly change during the reboot of the logical partition. The REST and Enhanced+ GUI do not provide the option of creating a Server adapter with the “Any” option. The usage of “Any” is not recommended when creating Server Adapters, though it’s possible in Classic GUI.

That’s right. Adapters set to “Any” won’t display in the enhanced GUI option.

This explanation made sense once I thought about it, but since it took me awhile to get this answer, hopefully I can save you some time and trouble by passing it along here. Then again, hopefully you aren’t creating server adapters without assigning them to clients in the first place, which would save you from ever having to deal with this issue at all. Going forward I know I’ll be more careful when assigning virtual adapters on my test machines.

The pdump Script

Edit: Do you know about this now?

Originally posted August 4, 2015 on AIXchange

Do you have a hung process on your AIX machine? Do you need more information about a running process? These are just two instances where the pdump script could help you:

The pdump script extracts information from the running process by using the kdb command and other AIX tools. This script can be especially helpful if you suspect the process is in a hung state or is suspected to be in an infinite loop.

The pdump.sh data gathering process includes:

  • svmon
  • proctree
  • thread information
  • user area information
  • lock information
  • current stack information

In order to use the script
Step 1. Determine what process is hung.

If you suspect a process is hung, first find its <pid#> using the ps command. Then, using proctree <pid#>, check if that process has child processes that it might be waiting on. If the parent process is waiting on a child process, then you should first try running pdump.sh on the last child found in the proctree.

Step 2. Run pdump on the process.
pdump.sh <pid#>
Where <pid#> is the process id that is suspected to be hung or looping.

If you cannot determine which specific process is hung, you may simply run pdump.sh against PID 1 (the init process) as a start point for investigation:
    pdump.sh 1

Tips
It is often helpful to run two pdumps on the same process at 60 second intervals. This will allow IBM AIX support center representatives to verify if that process made any progress in that time frame. Capture this information and include it in the test case you upload to IBM.

Try running pdump with only the -l flag (long mode) unless instructed by your support representative to do otherwise. The -d flag (call dbx) might fail to attach to the process when it is hung in kernel mode.

You can copy the script from here. Change the permissions to 700 before running it for the first time.

Have you tried this tool? Were you even aware of it?

Identifying SAN Devices

Edit: Still good stuff.

Originally posted July 28, 2015 on AIXchange

Anthony English recently tweeted about world wide port names (WWPNs), linking to this series of slides last updated by Anthony Vandewerdt in 2013. When working with SAN zoning storage devices and servers, it’s important to identify every piece of hardware. For those who work with IBM storage devices, determining the WWPN ranges used by each storage model is much simpler, thanks to the IBM Storage WWPN Determination guide. Vanderwerdt’s two-year-old slides are version 6.6 of the guide. When version 6.5 came out in 2012, he posted this explanation:

If this guide is new to you, its purpose it to let you take a WWPN and decode it so you can work out not only which type of storage that WWPN came from, but the actual port on that storage. People doing implementation services, problem determination, storage zoning and day-to-day configuration maintenance will get a lot of use out of this document. If you think there is an area that could be improved or products you would like added, please let me know.

It is also important to point out that IBM Storage uses persistent WWPN, which means if a host adapter in an IBM Storage device has to be replaced, it will always present the same WWPNs as the old adapter. This means no changes to zoning are needed after a hardware failure.

The document starts by defining WWPNs and world wide node names (WWNNs). It then lists the WWNN/WWPN ranges used by IBM products:

A WWNN is a World Wide Node Name; used to uniquely identify a device in a Storage Area Network (SAN). Each IBM Storage device has its own unique WWNN. For DS8000, each Storage Facility Image (SFI) has a unique WWNN. For SVC and Storwize V7000, each Node has a unique WWNN.A WWPN is a World Wide Port Name; a unique identifier for each Fibre Channel port presented to a Storage Area Network (SAN). Each port on an IBM Storage Device has a unique and persistent WWPN.
 
     – IBM System Storage devices use persistent WWPN. This means if an HA (Host Adapter) in an IBM System Storage Device gets replaced, the new HA will present the same WWPN as the old HA. IBM Storage uses a methodology whereby each WWPN is a child of the WWNN. This means that if you know the WWPN of a port, you can easily match it to the WWNN of the storage device that owns that port.
 
     – A WWPN is always 16 hexadecimal characters long. This is actually 8 bytes. Three of these bytes are used for the vendor ID. The position of the vendor ID within the WWPN varies based on the format ID of the WWPN. To determine more information we actually use the first character of the WWPN to see which format it is… .

Vanderwerdt also links to this list of companies that are registered with IEEE.

Share Your Product Ideas with IBM

Edit: Some links no longer work.

Originally posted July 21, 2015 on AIXchange

What new features and capabilities would you like to see added to AIX? How can you share your ideas with IBM?

In the past, customers could submit a design change request (DCR). This is now done with a request for enhancement (RFE).

Read more about RFEs here:

The following products are now available on the IBM RFE Community. This RFE Community update gives you the ability to enter additional Requests for Enhancements (RFEs), allowing for better communication between you and developers on more platforms and servers.

  • IBM AIX: The AIX operating system is an open standards-based, UNIX operating system that allows you to run the applications you want on Power Systems servers.
  • PowerHA: PowerHA SystemMirror for AIX technology is a high availability clustering solution for data center and multisite resiliency. It is designed to protect business applications from outages of virtually any kind, helping ensure round-the-clock business operations.
  • PowerSC: IBM PowerSC provides a security and compliance solution optimized for virtualized environments on Power Systems servers running the AIX operating system.
  • PowerVM VIOS: PowerVM provides a secure and scalable server virtualization environment for AIX and Linux applications built upon the advanced RAS features and leading performance of the Power Systems platform.

For details, check out these RFE FAQs and this list of status values and definitions:

The status of a request depends on:

  • Where the request is in our development lifecycle
  • Whether we are still considering the request
  • Whether we have approved it and plan to deliver it
  • Whether we have declined it.

Finally, here are a couple of videos. This roughly 8 minute video tells you how to watch for and receive RFE notifications. This longer video (it’s about 20 minutes) tells you how to submit, view and send out notifications on RFEs.

If you have an idea for enhancing AIX or any IBM product, or if you just want to discover what other users have suggested, why not engage in the process?

On a personal note, July 16 was the 8-year anniversary of AIXchange.

Over the years I’ve enjoyed hearing from the many readers who’ve told me that this feature has been educational or otherwise beneficial. Some of these readers have become good friends.

Through eight years, I’ve written in the neighborhood of 400 blog posts. Occasionally I’ll Google a term and be directed to something I wrote some years back, something I’d forgotten about. Besides jogging my memory, this often serves as reference material for topics I’m currently working on.

Although technology changes, I find that there’s still a wide audience for AIX- and Linux-oriented information, and I plan to continue to provide this into the future.

As always, if you have topics you would like to see covered, just drop me a line.

HMC Connectivity Security

Edit: Link still works.

Originally posted July 14, 2015 on AIXchange

This white paper, published in April, examines HMC 830 connectivity security:

This document describes data that is exchanged between the Hardware Management Console (HMC) and the IBM Service Delivery Center (SDC). In addition it also covers the methods and protocols for this exchange. This includes the configuration of “Call Home” (Electronic Service Agent) on the HMC for automatic hardware error reporting. All the functionality that is described herein refers to Power Systems HMC version V6.1.0 and later as well as the HMC used for the IBM Storage System DS8000.

The document covers HMC connectivity methods, with the caveat that “starting in 2015, new products will no longer have outbound VPN connectivity capabilities.”

Before the HMC tries to connect to the IBM servers, it first establishes an encrypted VPN tunnel between the HMC and the IBM VPN server gateway. The HMC initiates this tunnel using Encapsulated Security Payload (ESP, Protocol 50) and User Datagram Protocol (UDP).  After it is established, all further communications are handled through TCP sockets, which always originate from the HMC.

For the HMC to communicate successfully, the client’s external firewall must allow traffic for protocol ESP and port 500 UDP to flow freely in both directions. The use of SNAT and masquerading rules to mask the HMC’s source IP address are both acceptable, but port 4500 UDP must be open in both directions instead of protocol ESP. The firewall may also limit the specific IP addresses to which the HMC can connect.

Although modem connectivity is still supported for some systems, its use is being deprecated and the support has been removed from POWER8. IBM recommends the usage of internet connectivity for faster service, due to the size of error data files that may be sent to IBM Support. …

Configuring the Electronic Service Agent tool on your HMC enables outbound communications to IBM Support only. Electronic Service Agent is secure, and does not allow inbound connectivity. However, HMC can configure customer controlled inbound communications. Inbound connectivity configurations allow an IBM Service Representative to connect from IBM directly to your HMC or the systems that the HMC manages. The following sections describe two different approaches to remote service. Both approaches allow only a one time use after enabling.

Reasons for connecting to IBM
* Reporting a problem with the HMC or one of the systems it is managing back to IBM
* Downloading fixes for systems the HMC manages (Power HMC only)
* Reporting inventory and system configuration information back to IBM
* Sending extended error data for analysis by IBM
* Closing out a problem that was previously open
* Reporting heartbeat and status of monitored systems
* Sending performance and utilization data for system I/O, network, memory, and processors (Power HMC only)
* Transmission of live partition mobility (LPM) data (Power HMC only)
* Track maintenance statistics (Power HMC)
* Transmission of deconfigured resources (Power HMC only).

In addition, there’s a list of the data that is sent to IBM, including filenames and the information they contain:

When Electronic Service Agent on the HMC opens up a problem report for itself, or one the systems that it manages, that report is called home to IBM. All the information in that report gets stored for up to 60 days after the closure of the problem. Problem data that is associated with that problem report is also called home and stored. That information and any other associated packages will be stored for up to three days and then deleted automatically. Support Engineers who are actively working on a problem may offload the data for debugging purposes and then delete it when finished. Hardware inventory reports and other various performance and utilization data may be stored for many years.

There are also sections that cover multiple HMCs and the IP addresses and ports that IBM uses for connectivity.

As always I recommend that you take the time to read the whole document.

A Tool for SAN Troubleshooting

Edit: Still good stuff.

Originally posted July 7, 2015 on AIXchange

Are you looking for more information about your SAN? Do you want to learn about the LUNs that have been presented to your host? Maybe you want to be able to compare what your machine sees now as opposed to what it was seeing on the SAN.

IBM has a SAN troubleshooting tool that can help you. It’s called devscan:

The purpose of devscan is to make debugging storage problems faster and easier. Devscan does this by rapidly gathering a great deal of information about the Storage Area Network (SAN) and displaying it in an easy to understand manner. Devscan can be run from any AIX host, including VIO clients, or from a VIOS.

The information devscan displays is gathered from the SAN itself or the device driver, not from ODM, with exceptions described in the man page. The data is therefore guaranteed to be current and correct.

In the default case, devscan is unable to change any state on the SAN or on the host, making it safe to run even in production environments. In all cases, devscan is safer to run than cfgmgr, because it cannot change the ODM. Some of the optional commands devscan can use are able to cause a state change on the SAN. Details are provided in the man page.

Devscan can report a list of all available target devices and LUNs
For each LUN, devscan can report
· ODM name and status
· PVID, if there is one
· Device type
· Capacity and block size
· SCSI status
· Reservation status, both SCSI-2 and SCSI-3
· ALUA status
· Time to service a SCSI Read

Devscan scans a set of SCSI adapters, and then issues a set of commands to a set of targets and LUNs on those adapters. In the default case, devscan finds every Fibre Channel, SAS, iSCSI, and VSCSI adapter in the system and traverses each one. It issues SCSI Report LUNs and Inquiry commands to every target and LUN it finds. The set of adapters to be scanned, targets and LUNs to be traversed, and commands to be issued may be controlled with several of the optional flags.

Usage examples
1. To run against all SCSI adapters with the default command set (Start, Report LUNs, and Inquiry):
    devscan
2. To run against only the fscsi3 adapter and gather SCSI Status from all attached devices:
    devscan -c7 –dev=fscsi3
3. To determine what the NPIV client using WWPN C0507601A673002A can see through all Fibre Channel adapters on the VIOS (e.g., because the client cannot boot):
    devscan -t f -n C0507601A673002A
4. To run devscan in machine-parseable mode using “::” as the field delimiter:
    devscan –concise –delim=”::”
5. To run devscan against only the VSCSI adapters in the system and write the output to /tmp/vscsi_scan_results:
    devscan -tv -o /tmp/vscsi_scan_results
6. To scan only the storage port 5001738000330193:
    echo “f|||5001738000330193” | devscan –whitelist=-
7. To scan only the storage at SCSI ID 0x010400:
    echo “f|010400” | devscan –whitelist=-
8. To scan only for hdisk15:
    echo “hdisk15” | devscan –whitelist=-
9. To scan for all targets except the one with WWNN 5001738000330000:
    echo “f||||5001738000330000” | devscan –blacklist=-
10. To scan for an iSCSI target at 192.168.3.147:
    echo “192.168.3.147” | devscan –iscsitargets=-
11. To check the SCSI status of hdisk71 on all the Fibre adapters in the system and send the output to /tmp/devscan.out:
    echo “hdisk71” | devscan –whitelist=- -o /tmp/devscan.out -tf -c7 -F

1. Processing FC device:
    Adapter driver: fcs4
    Protocol driver: fscsi4
    Connection type: none
    Local SCSI ID: 0x000000
    Device ID: df1000fe
    Microcode level: 271102

The connection type of “none” indicates this adapter has never had a link.
2. Processing FC device:
    Adapter driver: fcs0
    Protocol driver: fscsi0
    Connection type: fabric
    Link State: down
    Current link speed: 4 Gbps
    Local SCSI ID: 0x180600
    Device ID: 77102224
    Microcode level: 0125040024

The link state of “down” indicates this adapter had a link up since the last time it was configured, but does not currently.
3. Nameserver query succeeded, but indicated  no targets are available on the SAN. This means the adapter’s link to the switch is good, but no storage is available, typically because the storage has unexpectedly left the SAN or because it was not zoned to this host port.

4. Processing iSCSI device:
    Protocol driver: iscsi0

    No targets found
    Elapsed time this adapter: 0.001358 seconds

For non-Fibre Channel devices, there is no name server, so the no-targets condition looks like this.

5. 00000000001f7d00 0000000000000000
    START failed with errno ECONNREFUSED

Devcsan is able to reach this device, so the host is connected to the SAN and the nameserver is reporting it, but we are not able to log in to the device. This is an end device problem.

6. Vendor ID: IBM Device ID: 2107900 Rev: 5.90 NACA: yes
PDQ: Not connected PDT: Unknown or no device
Dynamic Tracking Enabled
TUR SCSI status:

Check Condition (sense key: ABORTED_COMMAND;
ASCQ: LOGICAL UNIT NOT SUPPORTED)
ALUA-capable device
Report LUNs failed with errno ENXIO
Extended Inquiry failed with errno ETIMEDOUT
Test Unit Ready failed with errno EIO

Other usage examples can be found on the website. Download devscan and follow these installation instructions:

1. Download the package to your machine.
2. Uncompress and extract the archive. The binary and man page are placed in, /usr/local/bin and /usr/share/man/man1/, respectively, and are ready for use.

Here’s some of the output that I saw on a test machine:

    Running on host: vio1

    <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    Processing FC device:
        Adapter driver: fcs0
        Protocol driver: fscsi0
        Connection type: fabric
        Link State: up
        Local SCSI ID: 0x010000
        Local WWPN: 0x10000090fa535192
        Local WWNN: 0x20000090fa535192
        Device ID: 0xdf1000e21410f103
        Microcode level: 00010000020025200009

    SCSI ID LUN ID           WWPN             WWNN
    ———————————————————–
    0a0600  0000000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk14  Path:            0  VG:       None found
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    0a0600  0001000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk15  Path:            0  VG:       None found
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    0a0600  0002000000000000 500507680230e835 500507680200e835
        Vendor ID: IBM          Device ID: 2145     Rev: 0000 NACA: yes
        PDQ: Connected          PDT: Block (Disc)
        Name:          hdisk16  Path:            0  VG:    caavg_private
        Device already SCIOLSTARTed    Dynamic Tracking Enabled
        Status: Enabled
        ALUA-capable device

    2 targets found, reporting 20 LUNs,
    20 of which responded to SCIOLSTART.
    Elapsed time this adapter: 00.391183 seconds

Did you know this tool existed? Have you used it? What did you think?

A Message Worth Repeating

Edit: Some links no longer work.

Originally posted June 30, 2015 on AIXchange

I needed a bigger vehicle. Given my work with Boy Scouts, I spend considerable time on the road, hauling boys and their camping gear.

I wanted something big enough to comfortably contain eight or nine people and capably transport a trailer of supplies. I also wanted something that was reliable, well-maintained and a good value. I knew it’d take some time. Decently priced used vehicles like that don’t become available every day, and they typically sell within hours of being advertised. But after a few months of searching craigslist, I found a Chevy Suburban that fit my needs and requirements.

Many people are understandably apprehensive about car-buying. How do you know you’re getting what’s being advertised? There are options, services like Carfax that allow you to see different aspects of a used vehicle’s history. A few individual owners may also keep documentation of the work done to their vehicles. Even if you’re not a car person, you can have a trusted mechanic examine a used vehicle for you. While all of this is reassuring, you must still be alert for individual sellers or dealerships that may try to pawn off their problem by concealing costly issues. I certainly have no desire to own a high-mileage, poorly maintained used car.

The same principle applies to our computer hardware. This is something I’ve discussed often over the years. Of course I’m hardly the only one. Anthony English explores this issue in this recent posting. He even uses a car maintenance analogy in the second paragraph. His point? We must maintain our hardware and software.

Only a few weeks ago I wrote about the need to keep current on firmware and OS patches. Customers shouldn’t skip these updates or miss out on other enhancements that are made available on a regular basis.

In short, it pays to be proactive. Plus, it’s fairly simple now. By keeping up with AIX patches, as newer generations of POWER hardware come out, much of the enablement is already loaded into your operating system. Upgrading to new hardware can be as easy as performing a live partition mobility operation to migrate to your new equipment.

Related resources: David Tansley discusses log file maintenance in this article. I wrote about the value of keeping IBM hardware maintenance contracts on your gear here.

I’m sure we can all agree that, like our cars, our machines require regular hardware and software service.

By the way, so far so good with my Suburban. I took care of some minor issues after I got it, but I’ve already had it on a few campouts and it’s been great. You can bet that I plan to continue maintaining it to protect my investment, drive the boys back and forth to campouts safely, and enjoy it for years to come. Likewise, our machines are worth the same effort and investment.

A Look at HMC 8.2 (and Beyond)

Edit: Some links no longer work.

Originally posted June 23, 2015 on AIXchange

When you upgrade your HMC to Version 8.2, there’s a new “tech preview” option that’s intended to give you a feel for the direction that the HMC interface is heading. One of the big complaints heard from those new to POWER hardware and VIOS is how complicated it can all be to learn. I’m seeing a real effort being made to add greater functionality to the GUI so non-legacy POWER users can more easily adopt the platform. 

Two webinars have been held on this topic. Here is the Nigel Griffiths presentation (check out the slides and watch the replay); and here’s one from IBM developerWorks (slides and replay). 

The following information is borrowed from the slides. Watch both presentations to get a better feel for the technology. 

This new code runs best on HMC hardware CR6 or later. As for memory, you can try to get by with 4 GB but 8 GB recommended (CR7/8 start at 8GB). 

You can only get this to work with POWER6, POWER7 or POWER8 servers (no POWER5). 

What is a Technical Preview? It is there for:           

• Evaluation purposes           

• Technical familiarity           

• Learning and feedback 

Can you use it in production? Is it “supported”?

Answer : Yes and No 

Not allowed to raise a PMR           

• The PMR response would be “use the Classic version”           

• If you can reproduce the issue there, raise a PMR           

• But you can get support via the Forum           

– http://tinyurl.com/HMC8-Tech-Preview-Forum           

• DeveloperWorks pages Feedback on the Enhanced + Tech preview user interface           

• Developers seem to check for questions daily. 

The charts include some nice slides that show the options available in the various versions: classic, enhanced, or tech preview. I’ve used each version, though I keep going back to the classic version as that’s what I’m most familiar with, Still, knowing where things are headed, I’m making the effort to try the new code. 

There’s slide that shows the new HMC learning curve:            

1. Oh heck! What the dickens is this about. I can’t do this now!           

2. Oh nuts! I can’t find any thing!           

3. Oh! Darn . . . it’s got to be here somewhere!           

4. Oooo! That was cool!           

5. I wonder what that button does? Wow!           

6. Hey, I seem to be getting the hang of this now!           

7. Yep! This is workable.           

8. I have 5 minutes. Let’s try something I have never done before.           

9. When they get this working a bit faster – I will use it. 

For what it’s worth, I think I’m somewhere between 3 and 5. How about you? Do you have the new code loaded? Are you using it? Where are you on this scale?

A Docker Primer

Edit: Some links no longer work.

Originally posted June 16, 2015 on AIXchange

Lately I’ve been reading about Docker, and it seems to keep coming up everywhere I look. If you haven’t heard of it, it’s an “open platform for developers and sysadmins to build, ship, and run distributed applications.” Here’s more from Docker’s website:

Why do developers like it?    

“With Docker, developers can build any app in any language using any toolchain.   “Dockerized” apps are completely portable and can run anywhere — colleagues’ OS X and Windows laptops, QA servers running Ubuntu in the cloud, and production data center VMs running Red Hat.” 

Of course I must take this with a grain of salt, since Docker doesn’t support x86 applications running on POWER8 systems or vice versa. Still, it’s an interesting concept, and Docker does fully support running POWER8 applications on other POWER8 systems — all you need is an OS that has Docker installed. I’ll get to that in a bit.

Why do sysadmins like it?    

“Sysadmins use Docker to provide standardized environments for their development, QA, and production teams, reducing “works on my machine” finger-pointing. By “Dockerizing” the app platform and its dependencies, sysadmins abstract away differences in OS distributions and underlying infrastructure.”     

How is this different from Virtual Machines?    

“Each virtualized application includes not only the application – which may be only 10s of MB – and the necessary binaries and libraries, but also an entire guest operating system – which may weigh 10s of GB.”     

Docker    

“The Docker Engine container comprises just the application and its dependencies. It runs as an isolated process in userspace on the host operating system, sharing the kernel with other containers. Thus, it enjoys the resource isolation and allocation benefits of VMs but is much more portable and efficient.” 

To learn more about Docker, I turned to Twitter and found this link to five videos. While some are rather long, they’re all informative:     

“When you’re interested in learning a new technology, sometimes the best way is to watch it in action—or at the very least, to have someone explain it one-on-one. Unfortunately, we don’t all have a personal technology coach for every new thing out there, so we turn to the next best thing: a great video.     

First up, if you’ve only got five minutes (well, technically seven and a half), watch this. At Opensource.com’s lightning talk series last fall, Docker contributor Vincent Batts of Red Hat gave a great overview of what Docker is, what containers are, and how these technologies are changing the way system administrators and developers are working together to deploy applications in a modern datacenter.     

Now, you understand the concept, so let’s take a slightly deeper dive. Docker founder and CTO Solomon Hykes takes you beyond the basics of containers and into how Docker works, what problems it solves, and some real-world demos.” 

I’ve found that it’s simple to get Docker working on Power systems using Ubuntu 15.04. Once I had Ubuntu running on my system (a trivial process consisting of downloading the .iso to my virtual media repository, creating a Linux LPAR on a POWER8 machine and installing the .iso image), I ran:             

apt-get install docker.io 

For more on the installation process, see this piece from IBM developerWorks

While I’m at it, developerWorks has two other related resources — this how-to on using Ubuntu Core with Docker, and this more general Docker write-up:     

“Docker is an open-source container engine and a set of tools to compose, build, ship, and run distributed applications. The Docker container engine provides for the execution of applications composed as Docker images. Docker hub and Docker private registries are services that enable sharing of Docker images. The Docker command-line tool enables developers to build Docker images, to work with Docker hub and private registries, and to instantiate Docker images as containers in a Docker container engine. The relatively small size of Docker images—compared with VM images—and their portability enable developers to move applications seamlessly between test, quality assurance, and production environments, increasing the agility of application development and operations.” 

For me it was pretty interesting to fire up a Docker image of Fedora and run it on my Ubuntu machine. As far as the workload knows, it’s running on Fedora, but under the covers Ubuntu is running on a POWER8 machine. Of course nothing beats hands-on experience, but if you’re familiar with the concept of WPARs, Docker shouldn’t be hard to grasp. 

Even if you don’t intend to run it in production any time soon, I believe that Docker is worth exploring. Again, downloaded Power images and POWER8 “Dockerized” applications run on Docker just fine. It’s another interesting environment in which to work. So is Docker in your plans?

Fixing RMC Connections to the HMC with 8.8.2 SP1

Edit: Information may still be useful, although I doubt anyone is running this version of HMC code anymore.

Originally posted June 9, 2015 on AIXchange

Recently after upgrading to 8.8.2 SP1, I found my HMC was unable to communicate via RMC to my client LPARs. Though this document helped, when I ran the lspartition -dlpar command, I got this error message:             

Can’t start local session rc=2! 

The document notes that fix it commands run as root on the Management Console, and gives these specific commands to run as root:


            /usr/sbin/rsct/install/bin/recfgct
            /usr/sbin/rsct/bin/rmcctrl –p 

Of course the problem is you can’t run these commands if you can’t become root. So I contacted IBM Support to get my pesh passwords, and received this email: 

Thank you for contacting IBM.

I understand that HMC is displaying error “Can’t start local session rc=2!” when you run”lspartition -dlpar”.

To resolve this issue, please login to your HMC as hscpe, get root access, and run:

            /usr/sbin/rsct/install/bin/recfgct

Once you have done this, the “lspartition -dlpar” command should show current RMC connection status.

If you do not have the hscpe user on your HMC, you can create it with:

            mkhmcusr -u hscpe -a hmcpe -d “ibm”

If the user already exists, but you do not have the password, you can reset it with:

            chhmcusr -u hscpe -t passwd

You can also reset the root password with:

             chhmcusr -u root -t passwd

Once you login to the HMC as hscpe, run:

            pesh <hmc_serial_number>

You will be prompted for the pesh password which we have to generate using the HMC serial number is listed in the SE field with “lshmc -v”.

Enter the pesh password.

This will bring you to a prompt where you can run:

            su – root
            Password: <enter_root_password>

You can then run:

            /usr/sbin/rsct/install/bin/recfgct

Wait a few minutes, then run:

            lspartition -dlpar

You should not get the local session error. If you have issues with the RMC connection not being established for the LPARs, please let us know so that we can continue assisting with standard DLPAR troubleshooting procedures. 

In my case this was all I needed to do. Everything started working normally for me. 

Incidentally, since writing this, I came across someone else with the same issue. Hopefully as more folks get this information out here, more of us will have an easier time dealing with this problem. 

Getting Volume Group Info

Edit: Link no longer works.

Originally posted June 2, 2015 on AIXchange

In environments with machines containing many volume groups and filesystems, we want easy ways of manipulating that information. There’s always a need to know to know which filesystem is in which volume group. If you want to grow the size of a filesystem, you are going to want to know which volume group it is in so that you will be able to check if that volume group has free space available in it or not. 

Brian Smith has a useful post about getting this type of information

Quick tip: List details of all volume groups with lsvg on AIXThe “lsvg” command has a handy “-i” option, which the man page says, “Reads volume group names from standard input.” This brief description doesn’t explain how useful this option can be. 

If you run “lsvg” and pipe the ouput to “lsvg -i” (i.e., “lsvg | lsvg -i”) it will list the volume group information for every volume group on the system. You can also use other lsvg options such as “-l” to list all of the LV’s/Filesystems from every volume group: “lsvg | lsvg -li,” 

This is an excellent way to gather LVM [Logical Volume Manager] information from your system quickly and easily. Another way to use lsvg is to incorporate xargs. As noted in its Wikipedia entry, “in many other UNIX-like systems, arbitrarily long lists of parameters cannot be passed to a command, so xargs breaks the list of arguments into sublists small enough to be acceptable.”  

So, for example this:            

lsvg -o | xargs -n1 lsvg –l is similar to this:            

lsvg | lsvg -li 

Likewise, this:            

lsvg -o | xargs -n1 lsvg 

is similar to this:            

lsvg | lsvg –i 

What methods do you use to examine volume group information?

Updating System Firmware

Edit: Some links no longer work.

Originally posted May 26, 2015 on AIXchange

If you’re new to IBM Power Systems, you’re new to upgrading the HMC (see here and here). Furthermore, you’re new to system firmware updates. I’ve previously discussed firmware, and IBM Systems Magazine has other good articles about it (here and here). There’s also this step-by-step guide to updating your system firmware

“IBM Power Systems firmware update, which is often referred to as Change Licensed Internal Code (LIC) procedure, is usually performed on the managed systems from the Hardware Management Console (HMC). Firmware update includes the latest fixes and new features. We can use the Change Licensed Internal Code wizard from the HMC graphical user interface (GUI) to apply updates to the Licensed Internal Code (LIC) on the selected managed system.

We can select multiple managed systems to be updated simultaneously. The wizard also allows us to view the current system information or perform advanced operations. This tutorial provides the step-by-step procedure for the IBM Power Systems firmware update from the HMC command line, and the HMC GUI and is targeted for system administrators.

This step-by-step instructions can prepare the newbie for what needs to be done and how it could be done to stay on to the latest firmware level all the time. When you purchase a new hardware, the best [practice] is to upgrade all the firmware to the latest level. 

This tutorial provides the following information: 

-Current firmware details 

-Different kinds of code download and update methods 

-Steps to obtain the relevant firmware code updates or releases from the IBM FixCentral website 

-Steps to update the firmware concurrently using DVD media, that is, the fixes that can be deployed on a running system without rebooting partitions or performing an initial program load (IPL) within a specific release 

-Steps to update the firmware disruptively, that is, update requiring the system IPL within a specific release 

-Advanced code update options from the Change Licensed Internal Code wizard 

-Steps to upgrade to recent firmware releases disruptively using the File Transfer Protocol (FTP) method 

-Steps to upgrade the firmware disruptively through the IBM Service website to a required level.” 

Keep in mind that being able to actually get your hands on system firmware requires you to have entitlement for your machines. This means you must make sure you can actually get the code you need, preferably before you actually need it. There’s nothing worse than getting all the necessary approvals for a change window and system downtime, only to have to fail the change and reschedule it because you didn’t have the code you needed. 

How do your machines look? Are your HMC, system firmware and device firmware all at their recommended levels? If you aren’t sure what levels they should be running, don’t forget to check the fix level recommendation tool (FLRT).

When Rebooting, Don’t Forget About the System Profile

Edit: Still good stuff.

Originally posted May 19, 2015 on AIXchange

Recently a customer rebooted some systems that hadn’t been restarted in more than a year. All of the LPARs and the VIO servers were powered off so maintenance could be performed. The customer was able to use live partition mobility to relocate the important LPARs. That left just the dev and test environments. 

Of course, plenty of systems have gone much longer without a reboot, but restarting systems after a year-plus of continuous uptime can be tricky. And in this instance, problems emerged. Someone had done DLPAR operations without then updating the system profile. To make matters worse, the DLPAR operations were related to the VIO server and virtual fibre adapters. When the VIO servers came back up, the system didn’t recognize the dynamically added adapters, and the client LPARs wouldn’t boot.  

Luckily, the customer had hmcscanner output so they could see which adapters were missing based on the information in the client LPAR profiles. However, what should have been a quick restart ended up being a lengthy exercise because the profile information wasn’t in sync with what was actually running.  

How is your systems documentation? When you make a change, do you make sure that the profile has also been updated or saved? 

Along with mksysbbackupios and viosbr, be sure to backup your profile data on the HMC. You never know when someone might have made a change to the running systems and then neglected to backup the profile.

E850 Among the New POWER8 Servers Announced by IBM

Edit: As of the time of the writing the links still worked.

Originally posted May 11, 2015 on AIXchange

On April 28, IBM announced new capabilities for existing POWER8 servers. Today, it’s announcing a new POWER8 server model.

There is a new four-socket 4U server, the Power E850 server, machine type/model 8408-E8E, which will become generally available on June 5, 2015.

The Power E850 will support a maximum of 2 TB of memory, which is a 2X increase over the Power 750, with a statement of direction taking it to 4 TB of memory in the future. The E850 is also redesigned as a 4U server versus the 5U Power 750 and 760 that we had with POWER7+.

The E850 can have two to four processor sockets, up to 3.7 GHz. If you order a system with 2-processors, you will be able to add the third or fourth processors to the systems later if you want to with an MES upgrade, or you can populate your server with extra CPU and memory in advance to take advantage of processor and/or memory Capacity Upgrade on Demand, now available with this model.

The processor options for the E850 include up to 48 cores running at 3.02 GHz, up to 40 cores running at 3.35 GHz, or up to 32 cores running at 3.72 GHz. This server will be part of the small software tier.

The E850 will have 11 PCIe Gen3 slots, one of which will be populated by a LAN adapter of your choice; however, keep in mind that some of the slots and memory options may not be available if you do not populate all of the processor sockets. There are two x16 slots available per installed processor, and three additional x8 slots, on each system. If you populate the processors and choose to activate them later then you will have access to all of the available slots and memory.

The E850 is considered an enterprise server, with enhanced reliability features like Active Memory Mirroring for Hypervisor and Capacity Upgrade on Demand, although it will be a customer set up machine and will not be able to be part of a Power Enterprise Pool, like the E870 and E880 server are today.

The following is a list of E850 supported OS levels that I took from a chart from IBM.

If installing AIX LPAR with any I/O configuration:

  • AIX V7.1 TL3 SP5 and APAR IV68444, or later
  • AIX V7.1 TL2 SP7, or later (planned availability September 30, 2015)
  • AIX V6.1 TL9 SP5 and APAR IV68443, or later
  • AIX V6.1 TL8 SP7, or later (planned availability September 30, 2015)

If installing AIX Virtual-I/O-only LPAR:

  • AIX V7.1 TL2 SP1, or later
  • AIX V7.1 TL3 SP1, or later
  • AIX V6.1 TL8 SP1, or later
  • AIX V6.1 TL9 SP1, or later

If installing VIOS:

  • VIOS 2.2.3.51 or later

If installing the Linux operating system:      

-Big Endian

  • Red Hat Enterprise Linux 7.1, or later
  • Red Hat Enterprise Linux 6.6, or later

SUSE Linux Enterprise Server 11 Service Pack 4 and later Service Packs      

-Little Endian

  • Red Hat Enterprise Linux 7.1, or later
  • SUSE Linux Enterprise Server 12 and later Service Packs
  • Ubuntu 15.04

IBM is also announcing that the Power E880 server will fulfill their earlier statement of direction and now they can max out at four nodes instead of two, and a new 4 GHz clockspeed with up to 48 cores per node. Also fulfilling a previous statement of direction, the E870 now supports the larger memory 128 GB DIMMs, doubling its memory, and both the E870 and E880 may also now attach from one all the way up to four PCIe expansion drawers per node, meaning you could potentially have a four-node E880 with four expansion drawers per node, allowing for up to 192 adapters.

The E880 supports up to 192 cores at 4 GHz, or up to 128 cores at 4.4 GHz, and up to 16 TB of memory with 4 TB per node with 128 GB DIMMs. The E880 has eight PCIe adapter slots per node, along with the capability to have up to 16 PCIe I/O expansion drawers within a four-node system.

The E870 supports up to 8 TB memory, with 4 TB per node with 128 GB DIMMs. It runs up to 80 cores at 4.2 GHz or up to 64 cores at 4.0 GHz. You can order one or two nodes, which are still 5U per node, and you can have up to eight PCIe I/O expansion drawers with a two-node system, which will allow for up to 96 adapters.

With the last announcement, we were limited to either zero or two I/O drawers per E870/E880 node With this new announcement, we will now be allowed to use just one I/O drawer, along with the ability to configure one-half of an I/O drawer if necessary. The E870 supports from a half IO drawer up to four I/O drawers per E870 node; in a two-node E870 the range of I/O drawers is half to eight I/O drawers. The E880 supports from a half I/O drawer up to four I/O drawers per E880 node; in a four-node E880 the range of I/O drawers is half to 16 I/O drawers.

Both the E870 and E880 are part of the Medium software tier.

The I/O drawer can attach to all POWER8 servers:

  • E880 up to 16 I/O drawers
  • E870 up to eight I/O drawers
  • E850 with four sockets populated up to four I/O drawers; with three sockets populated up to three I/O drawers; with two sockets populated up to two I/O drawers
  • S824 with two sockets populated up to two I/O drawers; with one socket populated one I/O drawer
  • S822 with two sockets populated up to one I/O drawer; with one socket populated one-half I/O drawer

The I/O drawers are supported in numerous environments with the exception of OPAL hypervisor and PowerKVM environments.

There are also enhancements to the POWER8 scale-out servers.

There will be a new processor option in the S822 and S822L, you can now order a socket with an 8 core 4.15 GHz option for a total of up to 16 cores running at 4.15 GHz. Keep in mind the maximum memory with this option is 512 GB using the 16 GB or 32 GB DIMMs only. Due to the high cooling requirements there will also be limitations around which I/O cards can be installed in the CEC vs in an external expansion drawer so keep that in mind.

The S814 will allow for up to 1 TB and the S824 and S824L will allow for up to 2 TB of memory with the 128 GB DIMM option. This DIMM is too tall to fit in the 2U machines so do not expect to see it in the 2U servers.

IBM will now support the S824L without a GPU. In the last announcement, the S824L when the GPUs were installed ran only with bare metal Ubuntu. Now you are able to get PowerVM without the GPU.

Every scale out server will be able to use I/O expansion drawers. The maximum slots on a 2U one-socket or a 2U two-socket server with one socket filled becomes 10 slots. A 2U two-socket server with both sockets filled gives us 18 slots. A 4U one-socket or a 4U two-socket server with one socket filled will have 16 slots.

A 4U two-socket server with two sockets filled will have a maximum of 30 slots.

With the S814 rack model, you can get a 900W 110V power without an RPQ. The S814 tower model has always supported 900W power supplies.
There is a statement of direction that IBM will have water cooling available for the S822 and S822L.

As more of the Linux distributions now support little endian running on Power, you will need to upgrade VIOS to 2.2.3.50, which will add support for little endian Linux LPARs running side by side with big endian Linux LPARs, AIX LPARs and IBM i LPARs for the non-L models.

This newer version of VIOS will add another digit to the numbering schemes, which signifies minipacks. This is a cleaner approach to applying PTFs compared to using ifixes where you have to install and uninstall them.

You can read more about the changes to the strategy at ibm.biz/vios_service.

This makes for quite a portfolio of POWER8 servers and options that are now available from IBM, contact your favorite IBMer or business partner, I am sure that you can find the right machine for your environment.

Handy Tool Provides Adapter Info

Edit: Some links no longer work.

Originally posted May 5, 2015 on AIXchange

As I’ve mentioned, I follow several AIX and IBM Power Systems pros on Twitter.

Benoit Creau (@chmod666 on Twitter) is someone you should follow as well. He’s been working on a new tool called lssea that “lists information and details about PowerVM shared Ethernet adapters.” (Go here for the code, and here, here and here for some nice screenshots that will give you an idea of what to expect when you run the code on your system.) 

I found it very easy to set up. After running oem_setup_env on my VIO server, followed by vi lssea, I clicked on the button marked “raw” on the github page. Then I selected everything and cut and pasted it into the lssea file on my VIO server. 

Then I ran chmod u+x lssea, followed by ./lssea. It immediately showed me output listing the server on which it ran. I was also presented with my ioslevel, the version of the lssea code I’m running, and the date.

running lssea on vio1 | IBM,XXXXXXXXX | ioslevel 2.2.3.3 | 0.1c 030915            

SEA : ent9           

number of adapters   : 2           

vlans                : 1 2 3 4 5           

flags                : THREAD LARGESEND 

Again, I encourage you to check out the screenshots. It’s a quick way to determine which real adapters belong to which SEAs as well as find information about control channels, link status, speed, etc. By running it with the –b option, you’ll also get buffer information. As an added bonus, if you want to know how Benoit is getting the information that he’s displaying with lssea, it’s all there because you have access to the source code. 

I love tools like this that take output with which we’re all familiar with and provide useful new functionality.

Why You Should Keep a Local Alt Disk Copy

Edit: Some links no longer work.

Originally posted April 28, 2015 on AIXchange

After upgrading an AIX system, a customer found that they needed to back out of the change. They ended up restoring rootvg from a mksysb. 

Although that’s one way to do it, I don’t recommend it. Of course you should have an mksysb around in case of a disaster, but you should also have a local alt disk copy available. This is true for any type of upgrade, but it’s especially critical for both VIO server and regular AIX upgrades. 

In addition, a disk copy can come in handy if someone accidentally messes up rootvg during regular operations. You can switch your bootlist and reboot to a clean copy of your rootvg rather than try to restore from a backup. 

Here are several articles that explain this in detail.

IBM developerWorks: 

“With IBM Power virtualization, the VIOS plays an important role and all running VIOS client LPARs are fully dependent on the Virtual I/O Servers. In such an environment, updating VIOS to a next fix pack level can be challenging, without taking the system down for an extended period of time and incurring an outage. This can be mitigated by creating a copy of the current root volume group (rootvg) on an alternate disk and simultaneously applying fix pack updates first on the cloned rootvg on a new disk.” 

For example, updating VIOS 1.3.0.0 to 1.3.0.0-FP8, clone a 1.3.0.0 system, and then install updates to bring the cloned rootvg to 1.3.0.0-FP8. This updates the system while it was still running. Rebooting from the new rootvg disk brings the level of the running system to 1.3.0.0-FP8. If a problem with the new VIOS level were discovered, changing the bootlist back to the 1.3.0.0 disk and rebooting the server brings the system back to 1.3.0.0. Another scenario would include cloning the rootvg and applying individual fixes, rebooting the system and testing those fixes, and rebooting back to the original rootvg if there was a problem. 

This article explains the step-by-step procedure for applying the next fix pack level on VIOS by creating a copy of the current rootvg on an alternate disk and simultaneously applying fix pack updates. 

IBM Knowledge Center:

“The alt_disk_copy command allows users to copy the current rootvg to an alternate disk and to update the operating system to the next maintenance or technology level, without taking the machine down for an extended period of time and mitigating outage risk. This can be done by creating a copy of the current rootvg on an alternate disk and simultaneously applying software updates. If needed, the bootlist command can be run after the new disk has been booted, and the bootlist can be changed to boot back to the older maintenance or technology level of the operating system. 

Cloning the running rootvg, allows the user to create a backup copy of the root volume group. This copy can be used as a back up in case the rootvg failed, or it can be modified by installing additional updates. One scenario might be to clone a 5300-00 system, and then install updates to bring the cloned rootvg to 5300-01. This would update the system while it was still running. Rebooting from the new rootvg would bring the level of the running system to 5300-01. If there was a problem with this level, changing the bootlist back to the 5300-00 disk and rebooting would bring the system back to 5300-00. Other scenarios would include cloning the rootvg and applying individual fixes, rebooting the system and testing those fixes, and rebooting back to the original rootvg if there was a problem. 

At the end of the install, a volume group, altinst_rootvg, is left on the target disks in the varied off state as a place holder. If varied on, it indicates that it owns no logical volumes; however, the volume group does contain logical volumes, but they have been removed from the ODM because their names now conflict with the names of the logical volumes on the running system. Do not vary on the altinst_rootvg volume group; instead, leave the definition there as a placeholder. 

After rebooting from the new alternate disk, the former rootvg volume group shows up in a lspv listing as old_rootvg, and it includes all disks in the original rootvg. This former rootvg volume group is set to not vary-on at reboot, and it should only be removed with the alt_rootvg_op -X old_rootvg or alt_disk_install -X old_rootvg commands. 

If a return to the original rootvg is necessary, the bootlist command is used to change the bootlist to reboot from the original rootvg.” 

IBM developerWorks (again): 

“In 2009, I wrote about using alt_disk_copy… to clone your rootvg disks for ease of back-out when doing AIX upgrades or applications upgrades that resided on the rootvg disks. In that article, I did not cover hardware migrations as this was out of scope. In this article, I discuss how this can be achieved. The man page on alt_disk_copy states (by using the ‘O’ option), “Performs a device reset on the target altinst_rootvg. This causes the alternate disk install to not retain any user-defined device configurations. This flag is useful if the target disk or disks become the rootvg of a different system.” 

In a nutshell, this means that any devices that have had their attributes changed, typically by the system administer, are reset to the default value(s).” 

AIX Health Check:

It is very easy to clone your rootvg to another disk, for example for testing purposes. For example: If you wish to install a piece of software, without modifying the current rootvg, you can clone a rootvg disk to a new disk; start your system from that disk and do the installation there. If it succeeds, you can keep using this new rootvg disk; If it doesn’t, you can revert back to the old rootvg disk, like nothing ever happened.” 

And finally, here’s IBM’s “Introduction to Alt_Cloning on AIX 6.1 and 7.1”:

“This guide is intended for those who are new to alternate disk cloning, (or alt_clone for short) and would like to understand the alt_clone process.”

If you would like to learn more about alternate disk cloning, visit the IBM publib website and search on “alt_disk.” 

Do you keep spare LUNs around for your alt_disk copies? If not, why not?

Simplifying PowerVM Management

Edit: Some links no longer work.

Originally posted April 21, 2015 on AIXchange

In December I wrote about a document that covers HMC simplification. Actually, the doc isn’t just about that. It’s also about how IBM is trying to make managing PowerVM easier for customers.

From the document: 

“Managing the IBM PowerVM infrastructure involves configuring its different components, such as the POWER Hypervisor and the Virtual I/O Server(s). Historically, this has required the use of multiple management tools and interfaces, such as the Hardware Management Console (HMC) and the [VIO server] command line interface. 

The PowerVM simplification enhancements were designed to significantly simplify the management of the PowerVM infrastructure, improve the Power Systems management user experience, and reduce the learning ramp for users unfamiliar with the PowerVM technologies. 

This paper provides an overview of the PowerVM simplification enhancements and illustrates how to use the new features available in the HMC to set up and manage the PowerVM infrastructure.” 

Again though, there’s much more to this. How we manage our Power servers will soon undergo some changes. Here’s more from the document: 

“IBM PowerVM is the virtualization solution that enables workload consolidation for AIX, IBM i, and Linux environments on IBM Power Systems. 

The [VIO] Server is a software appliance that works in conjunction with the POWER Hypervisor to enable sharing of physical I/O resources among partitions. Two or more [VIO] Servers are often deployed to provide maximum RAS when provisioning virtual resources to partitions.” 

Next comes an explanation of how IBM is attempting to simplify things. For those of us who’ve worked on the platform for years, it’s pretty straightforward. But if you work with new POWER server users, it’s another matter. A bit of a learning curve will be involved as far as getting it all working and understanding what’s going on under the covers: 

“The PowerVM simplification enhancements encompass architecture changes to the POWER hypervisor and [VIO] Server, new virtualization management features, and new Hardware Management Console (HMC) graphical and programmatic user interfaces to manage the PowerVM infrastructure. The enhancements can be grouped in three main areas:

* Simplified PowerVM infrastructure deployment using templates.           

* Simplified PowerVM management and virtual machine provisioning.           

* Integrated performance and capacity monitoring tool. 

These enhancements are available when managing POWER6, POWER7, and POWER8 Systems using HMC V8.1 or later; except for the performance tool which is available with HMC V8.0 or later. VIO [Server] V2.2.3 or later is recommended for best performance. 

You can access all enhancements by logging in to the HMC Graphical User Interface (GUI) using the Enhanced or Enhanced+ log in option. The performance tool is also available with the Classic log in option. A comparison of the features available with each log in option can be found in the POWER 8 knowledge center.” 

The document offers quite a bit of detail. With the new versions of HMC code that are coming out, we’ll be able to do much more from the GUI. There won’t be as a great a need to configure machines from the VIO command line. Future posts will cover my impressions of the new HMC code, but for now, here’s more from the document: 

“Configuring and managing the PowerVM infrastructure on Power Systems can be accomplished performing the following tasks: 

1. Capturing and editing templates to create custom PowerVM configurations that can be deployed on one or more systems.

2. Deploying a system template to initialize the PowerVM infrastructure.

3. Creating a partition from template to get ready to deploy workloads.

4. Managing PowerVM to modify the virtual network and virtual storage configuration as needed to meet workload demands.

5. Managing partitions to dynamically modify their virtual storage and network resources as needed.

6. Monitoring performance and capacity information to understand resource utilization and identify potential problems.” 

Although I still prefer the command line, I can understand the desire to simplify PowerVM management. I know that for non-UNIX users and those with an IBM i background, things like command completion and shell history can be hard to understand. Rather than have to learn all of this, these folks now have the option to simply manage their machines via point and click: 

“You can view and modify all the [VIO] Server resources and configuration settings by selecting a [VIO Server] in the [VIO Server] overview and accessing the Manage task. The Manage task allows the user to change the processor, memory, physical I/O, and hardware virtualized I/O resources, e.g. logical Host Channel Ethernet Adapters or logical SR-IOV ports, configured to the [VIO Server], either dynamically, that is, while the [VIO Server] is powered on, or when the [VIO Server] is shutdown. 

You can view and modify all partition resources by selecting a partition and accessing the Manage Partition task. You can dynamically change virtual network, virtual storage, and hardware virtualized I/O resources configured to the partition. 

You can access the performance dashboard for a system by selecting a system and choosing Performance. The performance dashboard provides quick visualization of system and partition processor, memory, and I/O resources allocation and utilization… .” 

The PowerVM simplification enhancements available through the [HMC] significantly simplify virtualization management tasks on IBM Power Systems and support a repeatable workload deployment process. 

As with anything in technology, I like to consider how far things have come. It’s pretty incredible to look back on what we could do with early versions of VIO server and HMC code and compare it to the things we can do today. At the same time, as much as I relish looking back, I also look forward to what’s ahead. Where PowerVM is concerned, I’m excited about the future.

Setting Up LPAR Error Notification

Edit: Are you monitoring errors?

Originally posted April 14, 2015 on AIXchange

Your shop has no budget for monitoring software, but you still want to be notified when LPAR errors appear in the AIX error log. You have a few options. 

You could write scripts and periodically run them out of cron. You could set up a master workstation and use it to ssh into each of the machines you want to monitor and run errpt. Or you could set up your machines to send you email notifications of new errors. To do this, you could hard code an email address — either your own, a group address or some generic address (e.g., one that’s monitored by operations or the on-call person) — or you could route the emails to root on the server and set up a .forward file to distribute them to all the addresses you choose to designate. This nice how-to document has the details: 

“Having the pleasure of working across many client accounts, it’s funny to see some of the convoluted scripts people have written to receive alerts from the AIX error log daemon. Early in my AIX career, I used to do the exact same thing, and it involved a whole bunch of SSH keys, some text manipulation, crontab, and sendmail. Wouldn’t it be nicer if AIX had some way of doing all of this for us? Well, you know I wouldn’t ask the question if the answer wasn’t yes! 

Step 1
Create a temporary text file (e.g. /tmp/errnotify) with the following text:

errnotify: 

en_name = “mail_all_errlog” 

en_persistenceflg = 1 

en_method = “/usr/bin/errpt -a -l $1 | mail -s \”errpt $9 on `hostname`\” user@mail.com” 

Step 2
Add the new entry into the ODM.# odmadd /tmp/errnotify 

Step 3
Test that it’s working by adding an entry into the error log.

# errlogger ‘This is a test entry’ 

If required, you can delete the ODM entry with the following command:

# odmdelete -q ‘en_name=mail_all_errlog’ -o errnotify

0518-307 odmdelete: 1 objects deleted. 

To send notifications to multiple addresses, you can do something like ops@company.com,unix@company.com . To update your email address, be sure to do the odmdelete first; if you just rerun the odmadd, it will create multiple entries in the odm. To see the entries on your system use:                

#odmget -q ‘en_name=mail_all_errlog’ errnotify 

One caveat: I know of one environment that processed so much email and logged so many SAN errors that it actually impacted system performance. It would be nice if there was a way to limit the rate that error messages were sent out if a ton of errors were generated for some reason. This whole process assumes you have sendmail working. For those instructions, check out this IBM developerWorks article: 

To start the Sendmail daemon automatically on a reboot, uncomment the following line in the /etc/rc.tcpip file:

# vi /etc/rc.tcpip

start /usr/lib/sendmail “$src_running” “-bd -q${qpi}”

Execute the following command to display the status of the Sendmail daemon:

# lssrc -s sendmail

To stop Sendmail, use stopsrc:

# stopsrc -s sendmail

The Sendmail configuration file is located in the /etc/mail/sendmail.cf file, and the Sendmail mail alias file is located in /etc/mail/aliases.

If you add an alias to the /etc/mail/aliases file, remember to rebuild the aliases database and run the sendmail command with the -bi flag or the /usr/sbin/newaliases command. This forces the Sendmail daemon to re-read the aliases file.

# sendmail -bi

To add a mail relay server (smart host) to the Sendmail configuration file, edit the /etc/mail/sendmail.cf file, modify the DS line, and refresh the daemon:

# vi /etc/mail/sendmail.cfDSsmtpgateway.xyz.com.au

# refresh -s sendmail 

You can use this same method to monitor your VIO servers. 

How are you notified of LPAR errors?

The more Command and vi

Edit: Did you know you could do this?

Originally posted April 7, 2015 on AIXchange

It’s easy to overlook the simple things. For instance, did you know that vi can be invoked from within the more command? 

From “man more”:            

The more command uses the following subcommands:

             h            Displays a help screen that describes the more subcommands.

             v            Starts the vi editor, editing the current file in the current line.

 To try this out, run:

             ‘more /etc/hosts’

Then from inside your more session, type v and you will go into vi. By typing vi, you’re actually inside vi in editing mode.

Once you modify the file and save your changes, exit out to return to your more session and verify that your changes were made.

Be sure to look at some of the other options available in more, such as how to get to the very end or very beginning of a file, or how to skip ahead a particular number of lines. 

While we’re on the subject, here’s a reminder for newer AIX administrators: “set –o vi” gives you easy access to your shell history, along with command completion and other capabilities. 

I could go on about the usefulness of vi, but why not get a vi cheat sheet and see for yourself? I’ll list a couple (here and here), but there are many more online. Just search on “vi cheat sheet” and look at the image results.

More Terrifying Tales of IT

Edit: We see these stories these days when ransomware takes out critical systems.

Originally posted March 31, 2015 on AIXchange

I enjoy reading IT-related horror stories, especially those that hit close to home. For me, the best thing about these stories is figuring out what went wrong and then incorporating those lessons into my own environments. Here are a couple of good reads that I want to share.

First, from Network World:

        “Our response to the outage was professional, but ad-hoc, and the minutes trying to resolve the problem slipped into hours. We didn’t have a plan for responding to this type of incident, and, as luck would have it, our one and only network guru was away on leave. In the end, we needed vendor experts to identify the cause and recover the situation.
        Risk 1: The greater the complexity of failover, the greater the risk of failure.
        Remedy 1: Make the network no more complex than it needs to be.
        Risk 2: The greater the reliability, the greater the risk of not having operational procedures in place to respond to a crisis.
        Remedy 2: Plan, document and test.
        Risk 3: The greater the reliability, the greater the risk of not having people that can fix a problem.
        Remedy 3: Get the right people in-house or outsource it.”

I’ve always said that having a test system is invaluable, but simply having the system available to you isn’t enough. You must also make the time to use it, play with it, blow it up. And you absolutely cannot allow your test box to slowly morph into a production server.

This ComputerWorld article tells an even scarier tale of a hospital that was forced to go back to all paper when its network crashed. Though this incident occurred back in 2002, I believe it’s still relevant reading. Technology today is more reliable than ever, but troubleshooting is a skill we’ll always need.

        “Over four days, Halamka’s network crashed repeatedly, forcing the hospital to revert to the paper patient-records system that it had abandoned years ago. Lab reports that doctors normally had in hand within 45 minutes took as long as five hours to process. The emergency department diverted traffic for hours during the course of two days. Ultimately, the hospital’s network would have to be completely overhauled.
        First, the CAP team wanted an instant network audit to locate CareGroup’s spanning tree loop. The team needed to examine 25,000 ports on the network. Normally, this is done by querying the ports. But the network was so listless, queries wouldn’t go through.
        As a workaround, they decided to dial in to the core switches by modem. All hands went searching for modems, and they found some old US Robotics 28.8Kbps models buried in a closet. Like musty yearbooks pulled from an attic, they blew the dust off them. They ran them to the core switches around Boston’s Longwood medical area and plugged them in. CAP was in business.
        In time, the chaos gave way to a loosely defined routine, which was slower than normal and far more harried. The pre-IT generation, Sands says, adapted quickly. For the IT generation, himself included, it was an unnerving transition. He was reminded of a short story by the Victorian author E.M. Forster, “The Machine Stops,” about a world that depends upon an uber-computer to sustain human life. Eventually, those who designed the computer die and no one is left who knows how it works.
        He found himself dealing with logistics that had never occurred to him: Where do we get beds for a 100-person crisis team? How do we feed everyone?

        Lesson 1: Treat the network as a utility at your own peril.
        Actions taken:
        1. Retire legacy network gear faster and create overall life cycle management for networking gear.
        2. Demand review and testing of network changes before implementing.
        3. Document all changes, including keeping up-to-date physical and logical network diagrams.
        4. Make network changes only between 2 a.m. and 5 a.m. on weekends.

        Lesson 2: A disaster plan never addresses all the details of a disaster.
        Actions taken:
        1. Plan team logistics such as eating and sleeping arrangements as well as shift assignments.
        2. Communicate realistically—even well-intentioned optimism can lead to frustration in a crisis.
        3. Prepare baseline, “if all else fails” backup, such as modems to query a network and a paper plan.
        4. Focus disaster plans on the network, not just on the integrity of data.”

Anyone who’s spent even a few years in our profession has at least one good horror story. What’s yours? Please share it in comments.

Maxmizing IOPS

Edit: Some links no longer work.

Originally posted March 24, 2015 on AIXchange

Recently I listened to a discussion of the differences in input/output operations per second (IOPS) in various workload scenarios. People talked about heavy reads. They talked about heavy writes. They debated whether it was better to use RAID5, RAID6 or RAID10. Things got a little heated.

I came away thinking that I should cover this topic and share some resources with you. For instance, this article provides basic information about physical disks, but also makes some interesting points:

“Published IOPS calculations aren’t the end-all be-all of storage characteristics. Vendors often measure IOPS under only the best conditions, so it’s up to you to verify the information and make sure the solution meets the needs of your environment.

IOPS calculations vary wildly based on the kind of workload being handled. In general, there are three performance categories related to IOPS: random performance, sequential performance, and a combination of the two, which is measured when you assess random and sequential performance at the same time.

Every disk in your storage system has a maximum theoretical IOPS value that is based on a formula. Disk performance — and IOPS — is based on three key factors:

    Rotational speed
    Average latency
    Average seek time

Perhaps the most important IOPS calculation component to understand lies in the realm of the write penalty associated with a number of RAID configurations. With the exception of RAID 0, which is simply an array of disks strung together to create a larger storage pool, RAID configurations rely on the fact that write operations actually result in multiple writes to the array. This characteristic is why different RAID configurations are suitable for different tasks.

For example, for each random write request, RAID 5 requires many disk operations, which has a significant impact on raw IOPS calculations. For general purposes, accept that RAID 5 writes require 4 IOPS per write operation. RAID 6’s higher protection double fault tolerance is even worse in this regard, resulting in an “IO penalty” of 6 operations; in other words, plan on 6 IOPS for each random write operation. For read operations under RAID 5 and RAID 6, an IOPS is an IOPS; there is no negative performance or IOPS impact with read operations. Also, be aware that RAID 1 imposes a 2 to 1 IO penalty.”

Again, that article is focused on physical disks. But I’m also seeing more and more solid state devices (SSDs) being deployed. These charts compare spinning disks to SSDs, and they’re eye-opening. While a 15K SAS drive might see 210 IOPS, an individual consumer grade SSD might see 5,000 or 20,000 IOPS. Disk subsystems like the IBM FlashSystem 840 show 100 percent random 4K reads IOPS of 1.1 million, while a read/write workload might have 775,000 IOPS.

Here’s an interesting tool that lets you configure environments for SSD and physical disk and compare their performance. By moving other variables around, you can model hard drive capacity and estimate workload read /write percentages and drives being used.

What methods do you use when configuring your disk subsystem? Is SSD being deployed in your environment? What RAID levels are you targeting?

Readers Discuss VIOS Installs

Edit: With flash USB drives this is even easier now.

Originally posted March 17, 2015 on AIXchange

Perhaps I wasn’t clear when I explained why NIM is my VIO server installation option of choice. In any event, this reader’s response got me thinking further about the topic:

“I prefer (to) install from the virtual media repository. It is faster with no network issue. Because when you start install always network is not ready yet. And when you are updating if your NIM is not up to date (lpp_source, spot, mksysb, …) you take much time. With Virtual media repository you need just iso image to start install.”

To me, using the virtual media repository in the original scenario (installing a VIO server) presents a chicken-and-egg dilemma. If I could use a virtual media library, the VIO server would already be installed. But because one VIO server can’t be a client of another VIO server, I can’t use a virtual media repository to build a VIO server.

I absolutely agree that a virtual media repository is a great way to go when installing client LPARs. It just doesn’t help with the initial VIO server install.

Another reader shared a different scenario: He was installing IBM i and VIOS. Because he didn’t know how to build a NIM server (and he didn’t have access to AIX media anyway), installing from one was out of the equation. He tried using the HMC to install a second VIO server, but it wasn’t working. It would start to boot, but then the install would blow up. He had IBM Support specialists look at the logs, but they couldn’t find the problem.

Eventually, he arrived at an inelegant solution. He had a split backplane on his server, and was able to install VIOS1 from physical media with no issues. The DVD was attached to controller 1, with no way to swing it over to controller 2. So he added controller 2 from VIOS2 to his VIOS1 profile, and then restarted the VIOS1 profile. When he booted from the DVD, he opted to install VIOS to a disk that was on controller 2. Once the installation completed, he shut everything down, put controller 2 back into the VIOS2 profile, and booted both VIO servers. VIOS2 would have a few extra logical device definitions that were no longer available, but otherwise, everything worked.

While I don’t expect to run into these issues myself, it is nice to know that different installation options exist. You never know when someone else’s solution might get you out of your own jam.

It’s Time to Snuff Out Commodity Servers

Edit: These days it sounds like people are trying to outlaw them. Some links no longer work.

Originally posted March 10, 2015 on AIXchange

During a recent lunch with customers, the topic of smoking came up. Some were talking about smoking hookahs, some were talking about cigars, and some were talking about cigarettes. One of the guys had recently quit smoking. He credited the Internet, which pointed him to information about e-cigarettes.

He said e-cigarettes helped him curtail his nicotine intake, adding that the flavored e-liquids that had a more fruity taste helped him disassociate smoking with the flavor of tobacco. Then, eventually, he just stopping smoking entirely.

Someone said I should write about this, and wondered how I could possibly come up with an analogy that married smoking cessation with some technological topic. It was meant as a joke, but once I gave it some thought, I did make a connection: x86 servers. In the tech world, running Linux on commodity x86 servers is a bad habit that many of us want to break. However, we’ve been doing it for years, and we just can’t seem to stop. Sure, we’ve seen the ads telling us how our lives will be better once we quit, but some of us still can’t find a method that really works.

So has the analogy broken down for you yet? Yeah, me too. Admittedly, better analogies can be made in this case. For instance, when I think about what typically runs on Power systems, I usually imagine huge workloads that require massive amounts of uptime. These critical servers are the backbone of our businesses. Others have compared running Power systems to construction vehicles like giant earth-moving machines. Along those lines, I’ve seen IBM presentations that compared x86 servers and Power systems to bicycles and automobiles.

So would you try to move tons of dirt with a small pickup truck and a shovel? Would you put bicycle tires on a car? Then why do we insist on running the smaller and less critical workloads on slower, less powerful, less robust commodity hardware? Why aren’t we taking advantage of the machines we already have in our data centers, the same machines we trust with our most critical workloads?

We should run Power IFLs, which would enable us to fire up dark cores and memory on our larger machines at an attractive price. We should run Linux on Power with POWER8 scale-out servers with PowerVM or PowerKVM. Using these options, we could wean ourselves off commodity servers, and ultimately dispense with them entirely.

We should be educating ourselves as to why Power is the best choice. Google, Rackspace and others in the OpenPower Foundation are working on data center development around POWER8. Why aren’t you? Didn’t you see this report?

“Newly disclosed scores show Power8 beating Intel’s most powerful server processor, the 18-core Xeon E5-2699v3 (Haswell-EP), on important benchmark tests. Both processors deliver outstanding performance on the SPEC CPU benchmarks, but IBM’s huge advantages in multithreading and memory bandwidth favor Power8 when running larger test suites that more closely reflect real-world enterprise applications.

Overall, the results show that IBM offers a viable high-end alternative to Intel’s market-leading products. Equally important to Big Blue, Power8’s performance is energizing the OpenPower Foundation, an IBM-led alliance that rallies other companies to create a larger hardware and software ecosystem around the processor. IBM is offering Power8 chips to system builders in the merchant semiconductor market and is even licensing the architecture to other processor vendors. So far, the alliance has more than 80 members, including software, system, and semiconductor vendors.

Power8 is IBM’s most powerful microprocessor yet. On the merchant market, it’s available with 8, 10, or 12 CPU cores at maximum clock frequencies of 3.126GHz to 3.758GHz. Compared with its Power7+ predecessor, which is not a merchant product, Power8 offers twice the threads and L2 cache per core, up to 20% more L3 cache, a new L4 cache, up to four times the peak DRAM bandwidth, and twice the per-core SPEC CPU throughput.”

Whether it’s a force of habit or a lack of information, many customers continue to rely upon commodity hardware. Maybe it’s time to take a closer look at what you can do with POWER8.

The Laptop’s Future, Revisited

Edit: I still use my beefy laptop most of the time.

Originally posted March 3, 2015 on AIXchange

A reader had an interesting response to my recent post about the end(?) of desktop and laptop computers. With his permission, I’ll share some of our email exchange:

Hi Rob — Greetings from another dinosaur. For some reasons the comments in your article do not work for me. I think you’ve exposed just the tip of the iceberg. Here are a few more reasons why it is too early to declare Desktops/Laptops dead:

1. Battery, battery and battery again.
Smartphones still demand to be charged every day. Watching movies is fine but using the radio for voice drains the battery in just few hours. Streamed data transfers make things worse. Using the phone for an hour here and an hour there is fine but using it eight hours a day ultimately ties one to the power cord.

One could also add that removing the DVD allows you to slam a second battery in, and to replace the main battery with a spare. Using two 9-cells and an UltraBay battery, my TP was able to survive a transatlantic and a couple of connection flights.

 2. Long running jobs
Phones/tablets are good for those on the move, but running a long job ties one to a DB server. Yes, the report can run in a VM on a “remote desktop” server, and the tablet can be used as a RDP client. Virtual or not, the desktop is still needed. The phone in such cases is nothing more than a “thin client”; i.e. a dumb keyboard-and-screen device. And a rather cumbersome one, to be honest.

3. Security
Irritated by the size of a laptop an employee puts some confidential data on his/her phone. The data is not encrypted… to save energy. Every security conscious person knows the rest.

Here’s my reply:

I guess a phone guy could argue that he can plug into an external battery to recharge his phone as well, or, with the right model, just swap the phone battery out. However, I still don’t think I’m getting more with a phone compared to what I already have with a laptop. Just comparing the memory, disk space, screen size and processor, I don’t understand why I’d want to go backwards.

I’ve always assumed that I’ll be able to buy a laptop for years to come, but I guess there are those who believe they’re no longer needed. Maybe for a less-demanding user, a phone is perfectly fine. I still want to know where all these displays are that are just waiting for us to hook our phones up to them, especially out on raised floors, etc. I haven’t tried to connect to a serial port on a machine using my phone, but I know it works great on my laptop.

Later in our discussion, the reader adds:

A phone user can have a folding stand for his phone, a folding keyboard, an extra battery, a micro-USB to USB cable, a USB card reader, etc. Have I heard anyone mentioning a folding display? The light-as-a-feather turns out to be cable spaghetti. The “phone and phone only” approach works for those who seldom need anything else. Yeah, one can use any display available. Will he present to the whole crowd around the report for the next shareholders meeting he is currently working on?

The TrackPoint was invented to save the split-second movements between the keyboard and the mouse. Why not replace that with zoom out/scroll/zoom in? I doubt it would be faster.

What about multitasking? Throughout the course of the last 20 years I was always faster than the computer. On my desktop I often open tens of tabs, reading one while the others are loading. Closing the just read one, and instantly reading the next. On my laptop I am limited by the screen size. That is just not possible on a phone, or it is painfully slow. So I am much more productive.

There are many applications the phones and the tablets are well suited for. There are quite a few they are not. The “one size fits all” dream is as elusive as ever.

Incidentally, eWeek and Business2Community both have recent articles opining that laptops and desktops will remain with us for the foreseeable future. So I guess not wanting to run everything from my phone doesn’t, at this point, make me a dinosaur.

A Fun Look Back at Technology

Edit: I still have a landline and a Model M.

Originally posted February 24, 2015 on AIXchange

I like watching many of the old movies aired on TCM (aka, The Movie Channel). In addition to enjoying the stories being told, I just love seeing the clothing and the buildings and the landscapes of bygone eras. While I understand that much of what I’m viewing is actually black and white footage of Hollywood sets and interiors — as opposed to the “real world” that existed back then — I still find it fascinating to hear how people talked and see how they went about their daily lives.

I know I’m far from alone in this belief, but as a fan of old films, I’m convinced that they aren’t many new ideas coming out of modern Hollywood. A lot of premises and situations that hit the big screen 75 or more years ago are still being recycled today.

With this in mind, I want to tell you about some old films that offer a glimpse into world of technology. A number of great videos are available on YouTube, and as much as our machines have changed over generations, a lot of the information being presented remains relevant.

For instance, check out this 1937 video (produced by Chevrolet) that explains the technology behind an automobile’s differential. Tempting as it may be to dismiss this film based on that dorky intro music alone, there’s valuable information here. The filmmakers do a great job of explaining how engineers were able to solve the problems associated with sending power to two rear wheels. Honestly, I never realized that automobile engines could only deliver power to one wheel prior to this innovation.

In 1965, IBM UK came out with this production that I believe still holds up nicely. Entitled “Man and Computer,” this video reduces a computer to five basic functions: input, memory, calculation, output and a control unit. It’s also fun to see the symbols they used to represent each of these terms — among them, adding machines and typewriters. Everything covered here — how computers use instructions, how those instructions become a program, basic on/off electrical states — is explained simply enough for the non-technical user to understand. (And keep in mind that a half century ago, almost no one used computers.)

This video was so good it had me thinking fondly about the days of punch cards. Luckily for me, I quickly discovered this video about punch cards.

As I said, technology has obviously and immeasurably changed since these old films were produced. Nonetheless, I think even in the computer world, some of our early innovations still have value. Consider computing’s timeline: Once, timeshare machines predominated. Eventually, we got personal computers. When we wanted our disparate computers to be able to communicate with one another, client/server emerged. Then came the public Internet, virtual desktops and the cloud. Oh and the mainframe: Wasn’t that supposed to die 20 years ago? Wasn’t proprietary UNIX supposed to go with it? Yet here we are in 2015 with powerful new mainframes and POWER8 processors.

Of course I love new technologies. New servers running POWER8 are so much more powerful than their predecessors. Naturally, I want to see progress. At the same time, I still use a landline phone. Landlines remain the best option for long-running conference calls. I never worry about poor cellular connections or drained batteries. In addition, I prefer the old-school ThinkPad keyboards to any current keyboard design. And obviously, I still cling to my model M keyboard.

Embracing what’s new is fine, but just because something is brand new, that doesn’t mean we should throw out everything that came before it.

Connecting Your HMC to IBM Support, Revisited

Edit: Still good information.

Originally posted February 17, 2015 on AIXchange

In this August 2014 post I discussed how to connect your HMC to IBM Support.

That post includes a link to a .pdf document that outlines the different connectivity options. However, this IBM technote seems easier to work with:

“The following is a list of ports used by the HMC. The “Inbound application” column identifies ports where the HMC acts as a server that remote client applications connect to. Examples of remote client applications include the browser based remote access and remote 5250 console. Ports used by remote clients need to be enabled in the HMC firewall. They must also be enabled in any firewall that is between a remote client and HMC.

The “Outbound application” column identifies ports where the HMC acts as a client, initiating communications to the port on a remote server. Functions are further classified as Intranet or Internet. Intranet functions are typically limited to communications between the HMC and another HMC, partition or server inside the network. Internet functions require access to the Internet, directly or, in some cases, via a proxy. Because UDP is a directionless protocol, the HMC firewall must be enabled for UDP ports even though the communications may be initiated from the HMC. “Outbound” application ports must be enabled in external firewalls for the function to work. …”

The document then provides a lengthy list of commonly used ports. It also lists some typical configurations:

  • Firewall between the HMC and remote users: 443, 9960, 12443, 2300, 2301, 22
  • Firewall between HMC and other HMCs/partitions: Bi-directional 657 tcp/udp, 9900 udp, 9920
  • Firewall between the HMC and the Internet: Internet VPN 500/4500 udp, outbound 80, 443; outbound FTP
  • Firewall between the HMC and the Managed Server: TCP 443, 30000, 30001

If you’re looking for more information on setting up your HMC to call home, here’s another good how-to document that discusses setting up AIX or Linux to use a management console to connect to IBM service and support:

“This procedure contains the complete list of steps that are needed to set up connectivity to service and support. Some of these steps might already have been completed during the initial server setup. If so, you can use this procedure to verify that the steps were completed correctly.

In this information, an Internet connection is defined as access to the Internet from a logical partition, server, or a management console by direct or indirect access. Indirect means that you are behind a network address translation (NAT) firewall. Direct means that you have a globally routable address without an intervening firewall, which would block the ports that are needed for communication to service and support.”

On an unrelated note, if you have an issue with VIO server tasks on the HMC, this document may be helpful:

Error “3003c 2610-366” after apply of Service Pack 1
Technote (troubleshooting)

Problem(Abstract)
The apply of V8R8.1.0 Service Pack 1 or V7R9.1.0 Service Pack 1 may cause some VIOS related tasks to fail. Impacted HMC tasks include Manage PowerVM and Manage partitions task in the new V8R8 “enhanced GUI” as well as the Performance and Capacity Monitor (PCM). External applications using the HMC REST API such as IBM PowerVC are also impacted. The error text will typically include the error message “3003c 2610-366 The action array contains an undefined action name at index 0: VioService.”

Contact IBM support for the circumvention until a fix is available.

Another POWER8 Development Option

Edit: There are easier ways to get access to hardware.

Originally posted February 10, 2015 on AIXchange

If you’re looking to develop software for POWER8 systems but don’t have access to POWER8 hardware, there are options like the virtual loaner program or some kind of test system. You should also be aware of the IBM POWER8 Functional Simulator:

The IBM POWER8 Functional Simulator is a simulation environment developed by IBM. It is designed to provide enough POWER8 processor complex functionality to allow the entire software stack to execute, including loading, booting and running a Fedora 20 BE (Big Endian) kernel image or a Debian LE (Little Endian) kernel image. The intent for this tool is to educate, enable new application development, and to facilitate porting of existing Linux applications to the POWER8 architecture. While the IBM POWER8 Functional Simulator serves as a full instruction set simulator for the POWER8 processor, it may not model all aspects of the IBM Power Systems POWER8 hardware and thus may not exactly reflect the behavior of the POWER8 hardware.

Features
·        POWER8 hardware reference model
·        Models complex SMP effects
·        Architectural modeled areas:
o    Functional units (Load/Store, FXU, FPU, etc.)
o    Pipeline
o    Exceptions and Interrupt handling
o    Address translation
o    Memory and basic cache modeling (SLBs, TLBs, ERATs)
·        Linux and Hypervisor development and debug platform
·        Boots Fedora 20 (BE) and Debian (LE) kernel images
·        TCL command-line interface provides:
o    Custom user initialization scripts
o    Processor state control for debug: Step, Run, Cycle run-to, Stop, etc.
o    Register and Memory R/W interaction

Supported x86_64 host operating systems for running the IBM POWER8 Functional Simulator
· Fedora 20
· Red Hat Enterprise Linux 7.0
· Suse 12
· Ubuntu 14.10
Supported 64-bit Big Endian Linux distributions for booting the IBM POWER8 Functional Simulator
· Fedora 20
· Other distributions may function, however, no testing has been performed

For detailed information, check out the user guide and the command reference guide. I’ll highlight the user guide descriptions of the Simulator’s Linux and Standalone modes:

Linux Mode
In Linux mode, after the simulator is configured and loaded, the simulator boots the Linux operating system on the simulated system. At runtime, the operating system is simulated along with the running programs. The simulated operating system takes care of all the system calls, just as it would in a nonsimulation (real) environment.

Standalone Mode
In standalone mode, the application is loaded without an operating system. Standalone applications are usermode applications that are normally run on an operating system. On a real system, these applications rely on the operating system to perform certain tasks, including loading the program, address translation, and system-call support. In standalone mode, the simulator provides some of this support, allowing applications to run without having to first boot an operating system on the simulator.

Why not download the code and give it a try?

How Do You Handle Host Names?

Edit: Still worth putting thought into.

Originally posted February 3, 2015 on AIXchange

About a month ago this discussion hit the AIX mailing list. I’m posting the thread here to get your feedback.

First, the original question:

“Date:  Tue, 6 Jan 2015 15:41:01 -0600
From:  Russell Adams
Subject: Hostname as short name or FQDN?

Here’s a great question for the brain trust:

Which is actually the correct best practice for host names? The host name as a fully qualified domain name, or a short name?

Supporting documentation required! Thanks.”

And a reply:

“Date:  Tue, 6 Jan 2015 22:25:52 +0000
From:   Davignon, Edward
Subject: Re: Hostname as short name or FQDN?

Russell,

That is a really good question!

According to “man mktcpip”:
    -h HostName
        Sets the name of the host. If using a domain naming system, the domain and any subdomains must be specified. The following is the standard format for setting the host name:
        hostname
        The following is the standard format for setting the host name in a domain naming system:
        hostname.subdomain.subdomain.rootdomain

That being said, many sites use only the short name for the hostname.

Also keep in mind that “/etc/rc.net” sets the node name (uname -n) to the short name and the hostid based on the hostname (actually its IP address as resolved). “/etc/rc.net” sets the hostname based on the name for inet0 in the ODM.

It also brings up the questions of how to best configure names and aliases in “/etc/hosts” and how best to configure these in DNS or other naming services, so they match gethostbyaddr. Some related files are “/etc/resolv.conf”, “/etc/irs.conf”, and “/etc/netsvc.conf”. It has long plagued the community when gethostbyaddr (or gethostbyname) return different responses on the database server and the application server, because /etc/hosts does not match DNS.

I ran into a problem with this once with an early version of the DataGuard installer from Oracle. It got confused, since it did not have the FQDN in the hostname. The Oracle install guide clearly stated that the FQDN was required. This was the only time I have seen this matter.
Since we often cannot control data returned by naming services, it may be better to make sure gethostbyaddr or gethostbyname (i.e. the “host” command) return the same thing on all of the servers that use the hostname of the server you are configuring.

From “man uname”:  “-n” Displays the name of the node. This may be a name the system is known by to a UUCP communications network.”

Now, Russell’s reply to Edward:

“Date:   Tue, 6 Jan 2015 16:32:29 -0600
From:   Russell Adams
Subject: Re: Hostname as short name or FQDN?

A short hostname includes an empty subdomain.

I always use a short name, and then set domain in /etc/resolv.conf and ensure that the FQDN and short name are in /etc/hosts so reverse lookups fetch it.”

And finally, another reply from Edward:

“Date:   Wed, 7 Jan 2015 14:58:54 +0000
From:   Davignon, Edward
Subject: Re: Hostname as short name or FQDN?

A related question is how should /etc/hosts and DNS be configured for reverse lookups (i.e. lookups by address)?

Should /etc/hosts have “ipaddr fqdn shortname” or “ipaddr shortname fqdn”? Likewise for DNS, should the reverse lookup return “fqdn” or “shortname” or alternate using round robin?

I pose these as questions, but they are really things to check when troubleshooting applications that rely on name resolution.

DNS can be queried directly using “dig” or “nslookup”.

I have seen numerous misconfigured /etc/hosts files that don’t match DNS for reverse lookups. I have also seen DNS servers return “shortname” instead of “fqdn”. I have seen DNS alternate “fqdn” or “shortname”. I have also seen DNS return wrong domainnames, too.

Usually the problem I see is /etc/hosts has “ip shortname” or “ip shortname fqdn”, but DNS reverse lookups return “fqdn”. This causes inconsistency between the local server and the remove (app) servers, usually resulting in inconsistency of access controls between app servers, or an app server and its database server. This can also happen when someone changes /etc/netsvc.conf from empty to “hosts=local,bind4”. I use “grep ‘^[^#]’ /etc/netsvc.conf” to check it; it grabs non-blank lines that don’t start with a comment character.”

The discussion died out at this point, but it got me wondering what my readers typically do. I prefer to use a shortname for the host, and then make sure /etc/resolv.conf is set up correctly. Would any of you care to make an argument for having a FQDN in your environment?

Security Behind the Firewall

Edit: Still worth considering.

Originally posted January 27, 2015 on AIXchange

Although many of us like to assert that AIX running on Power hardware is a secure operating system, we must be aware of methods that might be used to try to compromise the systems we maintain. Just because the AIX user base is smaller than their Windows or Linux counterparts, we shouldn’t assume that AIX systems cannot be breached and aren’t being targeted. These systems typically run software for hospitals, banks, manufacturers, etc., industries where uptime and performance are critical and data privacy is essential.

With that in mind, this recently released document, entitled AIX for Penetration Testers, examines the delicate balance between providing user access and maintaining system security:

“AIX is a widely used operating system by banks, insurance companies, power stations and universities. The operating system handles various sensitive or critical information for these services. There is limited public information for penetration testers about AIX hacking, compared the other common operating systems like Windows or Linux. When testers get user level access in the system the privilege escalation is difficult if the administrators properly installed the security patches. Simple, detailed and effective steps of penetration testing will be presented by analyzing the latest fully patched AIX system. Only shell scripts and the default installed tools are necessary to perform this assessment. The paper proposes some basic methods to do comprehensive local security checks and how to exploit the vulnerabilities.

“The reconnaissance process is the most important task. If an auditor has enough information about the target system, applications and the administrator, it can lead to privilege escalation. After getting user level access on an AIX system, start by finding and exploiting operation issues caused by the administrator.”

Based on information in the document, here are some basic security questions to ask and answer:

* sudo: it is properly configured?

* umask settings: have they been changed from defaults?

* exploitable SUID/SGID binaries: do they exist on the system?

* the PATH: has it been set up properly?

“This methodology defines key local vulnerable points of AIX system. Auditors can make their own vulnerability detection scripts to decrease the time of the investigation based on this methodology. The suggested test steps are information gathering, exploit operation bugs, checking 3rd party software and finally the core system. Valuable information and great ideas are hidden in system guides, developer documentation and man pages. This methodology only describes quick and useable techniques. There are many other vulnerability assessment concepts worth the research, including syscall, signal or file format fuzzing.

“System administrators and auditors can apply useful hardening solutions from the vendor [IBM]. There is a secure implementation of the AIX system called Trusted AIX (IBM, 2014). The mentioned hardening features and guides can increase the local security level of the operating system. Hardening supplemented by professional penetration testing is the proper way to do security.”

Although many organizations like to think that being behind a firewall makes them secure, they forget that trusted users are behind many successful attacks.

What are you doing to protect your systems from unauthorized access and privilege escalation?

vtmenu and Life’s Little Annoyances

Edit: Still good to remember how to disconnect. And still worth asking, how many little annoyances do you just choose to live with?

Originally posted January 20, 2015 on AIXchange

Recently a friend asked me about vtmenu:

“You know when you run vtmenu and you exit using ~. and it disconnects your ssh session to the HMC. Do you remember the keystroke combination which will just return me to the vtmenu or HMC command line? Can’t find it anywhere.”

This has happened to me before, I enter ~. and instead of going back one level it completely disconnects me from my ssh session to the HMC. Usually I’m absorbed with some task or problem — in the zone, you could say — and I just return to the HMC console, run vtmenu again, and reconnect to my partition without giving it a thought. I consider it another of life’s minor annoyances, like remembering to run set –o vi or stty erase ^? if your profile hasn’t been set up.

Of course to new users, those annoyances can really add up. But surely there is a solution.

Another friend offered this suggestion:

“Use ~~.~. is also the openssh exit. It doesn’t affect putty and the Windows ssh clients, but if you’re on linux… you quit ssh.”

I found similar advice here. And when I searched the AIXchange archives, I rediscovered this post.

“You can also use the mkvterm –m -p command if you know the machine name and the LPAR name. I find vtmenu to be useful if you do not know that information off the top of your head. If you need to get the machine name, try lssyscfg –r sys, then use lssyscfg -r lpar -m –F to get a list of LPAR names. If someone else is using a console, or you left a console running somewhere else, you can use the rmvterm –m -p command. 

In any event, when you are done using a console, you can type ~~. in order to cleanly exit, and you will get a message that says Terminate session? [y/n]. Answer with y and you will go back to the vtmenu screen or to the command line, depending on what method you used to create the console.”

Pleased as I was to find this solution, it didn’t work for my friend. And because I couldn’t reproduce the problem, I was unable to offer further help. So the question remains: How do you cleanly disconnect from inside vtmenu? Hopefully my readers have some suggestions.

And in general, how accepting are you of these types of annoyances? Do you shrug them off, or do you put some effort into solving these problems? What about those of you who work on others’ machines where fooling around with .profiles and the like might not be appreciated?

Why I Choose NIM to Install VIOS

Edit: This is still good stuff.

Originally posted January 13, 2015 on AIXchange

In May 2013 I wrote about installing the VIO server using the HMC GUI. This more recent article covers the same topic. At the end of Bart’s post he mentions using the virtual media repository, something I covered here.

While I have used the HMC GUI, I prefer to set up the NIM server and use it to load the VIO server. Bottom line, using NIM is faster than using the HMC GUI.

Of course, there are instances when NIM isn’t an option, such as IBM i environments that run VIO servers. Another example is a new or still under construction data center. Some of my customers fall into this category, and as a result I frequently do system builds in data centers that don’t yet have a NIM server. Often in these situations I don’t even have a network available, because the network guys are simultaneously installing and configuring their gear. Until the network is up and running, the physical machine is all I have.

So what are the options at this point? I could run crossover cables and get the HMC to talk to a network adapter on the system, and then use the HMC to install the VIO server. If there’s physical media, installing from a DVD is an option, although with smaller systems that have split back planes it can be tricky to use the DVD to install to a second VIO server.

In the past I’ve loaded the VIO server to internal disk, created a virtual media repository and copied VIO and AIX DVDs over to that repository. Then I use those .iso images to build a NIM server that’s booting from a free local drive. Once the NIM server is built, I can use it to create my second VIO server via the internal network. At least this allows me to load LPARs across the internal virtual network while I wait for my physical network to be built out. If you’re on a strict timeline (and really, when are we not?), this method can help you be productive as you wait for the network to become available.

 I’ve also been in situations where the network was running, but VLAN tagging was in place. In such a scenario, I would go into SMS and set up VLAN tagging for my remote IPL to use for booting. However, there’s no option that I know of to define a VLAN within the HMC GUI (if that’s what you’re using to install VIO server). Sure, this can typically be handled by asking a network admin to temporarily change the VLAN configuration, but of course, some network guys are more amenable to such a request than others. It’s something to be aware of.

Here’s another advantage to using NIM rather than install from the HMC: I had a customer that wanted to set up a third test VIO server using the HMC GUI. They had a spare fibre card, but no spare network card. This wasn’t an issue since they could put the VIO server onto an existing internal VLAN and communicate externally via the existing shared Ethernet adapters on their other two VIO servers. The problem was the GUI only recognizes physical adapters, not virtual ones. Using NIM, we were able to get it to work.

What’s your preferred way to install new systems?

Getting Detailed VIO Server Info

Edit: Still good stuff.

Originally posted January 6, 2015 on AIXchange

I wanted to know the VIO server version I was running. Simple, right? I ran $ioslevel, and learned I was on Version 2.2.3.4.

That’s nice, but how do you find out if you’re running fixpacks or service packs? How can you get more information about the version you’re on?

 The help command was very helpful:

            $help ioslevel

            Usage: ioslevel

                   [Reports the latest installed maintenance level of the system.]

In case you missed it, I was being sarcastic.

Then I remembered that there should be a file that contains this sort of information. Unfortunately, I couldn’t recall where it was located. Was it in /etc, along the lines of /etc/redhat-release on a Redhat server? Nope.

So I took a look at the command that was being run under the covers when I ran ioslevel. First I enabled CLI debugging with:

          export CLI_DEBUG=33

Then I tried running ioslevel again:

          $ioslevel

          AIX: “cat /usr/ios/cli/ios.level “

          2.2.3.4

I already had this output, but the debugging did help me locate the file I was seeking. I went into /usr/ios/cli and found these files:

            .license

            .profile

            .profile.ce

            FPLEVEL.txt

            README.txt

            README.vios

            SPLEVEL.txt

            cron_mail_check.sh

            environment

            ios.level

            ioscli

            itm

            langlist

            lsvirt.snap

            man.ksh

SPLEVEL.txt, FPLEVEL.txt and ios.level were the files I needed. Checking them, I discovered I was running:

            $ cat /usr/ios/cli/ios.level

            2.2.3.4

            $ cat /usr/ios/cli/SPLEVEL.txt

            SP-01

            $cat /usr/ios/cli/FPLEVEL.txt

            FIXPACK:FP-25

That solved my problem, but I still have questions. Am I mistaken, or at some point did ioslevel provide all this information automatically? Has something changed recently?

Chef Client and Other Nuggets from Twitter

Edit: Some links no longer work.

Originally posted December 23, 2014 on AIXchange

Recently on his Twitter feed, IBMer Jay Kruemcke noted that Chef Client is now available on AIX. I’ll write about this in detail in the near future, but for now, here’s the word from Chef’s website:

“Today I’m very pleased to announce the availability of Chef Client 12.0 for IBM AIX 6.1 and 7.1. It is freely available from our downloads page and can be used with any version of the Chef Server up to and including Chef Server 12.

This is the first major new platform for Chef in some time, and it’s certainly been a long time in coming. We’ve heard from some of our large enterprise customers that they have significant investment in IBM’s AIX platform and expect to continue that into the future, so they would like to manage those systems using the same flexible automation tool, Chef, with which they are already familiar.”

I’ve long recommended Twitter as a resource for finding news and information about AIX and related topics. I should point out that you don’t need to have your own Twitter feed to benefit from it. For example, you can find my own feed simply by searching on “Rob McNelly twitter.” (For the record, here’s the direct link.) Twitter’s advanced search function is another, more direct way to dive in. Check out the results for a search on “IBM Power Systems.”

As a regular user of Twitter, I follow a number of AIX experts, including Kruemcke. Other recent tweets of Jay’s have led me to a document that summarizes update benefits for AIX releases and related offerings, as well as videos on using nmon interactively and capturing data to nmon files so they can be analyzed later.

Jay also tweeted about this white paper covering HMC simplification:

“Managing the IBM PowerVM infrastructure involves configuring its different components, such as the POWER HypervisorTM and the Virtual I/O Server(s). Historically, this has required the use of multiple management tools and interfaces, such as the Hardware Management Console (HMC) and the Virtual I/O Server command line interface.

The PowerVM simplification enhancements were designed to significantly simplify the management of the PowerVM infrastructure, improve the Power SystemsTM management user experience, and reduce the learning ramp for users unfamiliar with the PowerVM technologies.

This paper provides an overview of the PowerVM simplification enhancements and illustrates how to use the new features available in the HMC to set up and manage the PowerVM infrastructure.”

While I’m at it, here are some other nuggets I’ve found on Twitter lately. Via IBM’s Nigel Griffiths, learn how to use nmon analyser in this Steve Atkins video. And from IBM’s Chris Gibson, here’s information about an STG lab services offering and the LPM automation tool. And thanks to Christophe Rousseau and cmod666.org, I even came across this humorous visual about live migrations.

Do you use Twitter to locate AIX resources? Who do you follow?

 This blog will be updated on January 6, 2015. Happy New Year!

Is This the End for Desktops and Laptops?

Edit: I am still running my laptop and desktop.

Originally posted December 16, 2014 on AIXchange

Lately I’ve read several commentators predict that desktops and laptops will soon be completely replaced by tablets and phones. I sure hope they’re wrong.

Although there’s much to be said for the portability of a phone or a tablet, I still find that there’s much value to be derived from a laptop. Consider this scenario:

“In the classroom, I took my brand new iPhone 6, plugged it into the lecture theatre’s HDMI port, and ran the whole presentation — in high definition, complete with nicely animated transitions — off my phone.”

Really? Surely the author didn’t use his iPhone 6 to create that presentation? What device would you use to look up the material, edit it together and prepare it to be copied over to the iPhone 6?

One thing I’d immediately want in that scenario is multiple monitors, the bigger the better. One screen to search for the clips, and another open with my editing software to create the presentation.

“My friend runs the IT infrastructure for one of Australia’s most successful online retailers. It’s his job to make sure the customer-facing systems ringing up sales are available 24×7. Always on call, getting texts advising him of the status of his servers, services, and staff, he keeps a laptop close at hand, in case something ever needs his personal attention. Something always does.

“‘Got a little Bluetooth keyboard to go along with it,’ he continued. ‘When I’m in the office I’ll AirPlay it over to an Apple TV connected to a monitor. What’s the difference between that and a desktop?'”

 To me one difference is the tools I’m using to do the job. I still have a pretty decent keyboard and pointing stick, which I still feel is a superior “mouse” compared to a track pad. With this new rig they are proposing I would have to make sure to charge my Bluetooth keyboard and find some free monitor I can use somewhere. I’d prefer to just carry the whole thing with me. 

“The desktop has been dead for some years, resurrected to an afterlife of video editing and CAD. Laptops keep getting smaller and more powerful, but we’ve now reached a moment when they’re less useful than our smartphones.

“The laptop market will not collapse overnight. There’s a lot of inertia in IT — people like what they like and tend to use what they know — but the current cycle of PC replacements is likely to be the last one.

“The computer as we have known it, with integrated keyboard and display, has lost its purpose in a world of tiny, powerful devices that can cast to any nearby screen (Chromecast & AirPlay), browse any website, and run all the important apps. Why carry a boat anchor when you can be light as a feather?”

Maybe this is part of my problem. I like my boat anchors. I like knowing that I have an optical device I can use to burn DVDs in a pinch. I like knowing I can pull out that DVD burner and replace it with a second hard drive. I like knowing I can max out the memory and run some flavor of virtualization software (like VMware) and run multiple operating systems at once.

No doubt, many things can be done with a phone, a Bluetooth keyboard and a borrowed monitor. I’m sure for some if not most users, it’s all they need:

“We still rely on devices with processors and memory, they are just different devices. The mobility trend has been clear for years with notebooks today demanding larger market share than desktops. And one thing significant about notebooks is they required of us our first compromise in terms of screen size. I write today mainly on a 13-inch notebook which replaced a 21-inch desktop, yet I don’t miss the desktop. I don’t miss it because the total value proposition is so much better with the notebook.

“What’s still missing are clearcut options for better I/O — better keyboards and screens or their alternatives — but I think those are very close. I suspect we’ll shortly have new wireless docking options, for example. For $150 today you can buy a big LCD display, keyboard, and mouse if you know where to shop. Add wireless docking equivalent to the hands-free Bluetooth device in your car and you are there.”

If you believe some people, the end of the desktop and laptop is already here:

“This year we’ll see an important structural change take place in the PC hardware market. I’m not saying there won’t still be desktop and notebook PCs to buy, but far fewer of us will be buying them….

“The iPhone in your pocket will become your desktop whenever you are within range of your desktop display, keyboard and mouse. These standalone devices [were] Apple’s big sellers in 2014 and [will be] big sellers for HP and Dell in 2015 and beyond. The next iPod/iPhone/iPad will be a family of beautiful AirPlay displays that will serve us just fine for at least five years linked to an ever-changing population of iPhones.”

Have you ditched your laptop because you can do everything you need to do from your phone? Are you close to doing so? I can’t be the only dinosaur out here.

Creating Additional VLANs With the HMC CLI

Edit: Some links no longer work. I still love the CLI.

Originally posted December 9, 2014 on AIXchange

When I create LPARs, I prefer the HMC command line interface (CLI) to the GUI. The CLI is especially advantageous when I’m using a VPN to a remote site that’s slower than I’d prefer. Pointing and clicking and going through wizards is inconvenient — and it becomes exponentially so when you’re talking about setting up hundreds of LPARs at a time in multiple data centers.

I’ve previously discussed the HMC CLI, and of course I’m not the only one to have written about the various options (herehere and here). Still, I couldn’t find any good examples of a scenario where you’re creating virtual Ethernet adapters with multiple additional VLANs.

The typical GUI interface looks like this: 

In this simple example I’m looking to have additional VLANs on this virtual interface. This information on chsyscfg was helpful:

                virtual_eth_adapters
                Comma separated list of virtual ethernet adapters,
                with each adapter having the following format:

                virtual-slot-number/is-IEEE/port-vlan-ID/
                [additional-vlan-IDs]/[trunk-priority]/
                is-required[/[virtual-switch][/[MAC-address]/
                [allowed-OS-MAC-addresses]/[QoS-priority]]]

In particular, this caught my eye:

                If values are specified for additional-vlan-IDs, they
                must be comma separated.

So here you have a command that is expecting a comma separated list of virtual Ethernet adapters, and inside of that you have a comma separated list of additional VLANs.

However, things didn’t go as I hoped. I tried many different combinations of “ and ‘ and \” but couldn’t get it to work. I finally opened a PMR, so I’ll skip to the conclusion you can hopefully avoid my pain. All I was doing was adding in an adapter to an existing profile, and that’s what ultimately worked. Obviously the serial number and the actual VLAN names are changed, but this command should correspond to the example from the GUI above.

chsyscfg -m SN12345 -r prof -i name=default,lpar_id=2,\”virtual_eth_adapters=\”\”4/1/1/1,2,3,4,5/2/1\”\”\” 

As you likely expect, doing more than one makes it even more convoluted:

chsyscfg -m SN12345 -r prof -i name=default,lpar_id=2,\”virtual_eth_adapters=\”\”2/1/1/1,2,3,4,5/2/1\”\”\,\”\”3/1/2/6,7,8,9,10/2/1\”\”,4/0/10//0/1\”

Just imagine if you had 15 or 20 VLANs per virtual adapter that you needed to deploy throughout your environment.

Maybe this syntax is obvious to you, but it wasn’t obvious to me, so I’m putting this information out there for the next person who needs to determine how to use a comma separated list inside of another comma separated list with the HMC CLI.

Dummy Devices

Edit: Do you still manage disks manually?

Originally posted December 2, 2014 on AIXchange

A customer was looking to assign SAN LUNs to a pair of VIO servers using vSCSI. When the VIO servers were configured, VIO1 was assigned adapters containing odd numbers of internal disks:

            VIO1

            lsdev -Cc disk

            hdisk0 Available 04-00-00 SAS RAID 0 SSD Array

            hdisk1 Available 04-00-00 SAS RAID 0 SSD Array

            hdisk2 Available 04-00-00 SAS RAID 0 SSD Array

            hdisk3 Available 0C-00-00 SAS Disk Drive

            hdisk4 Available 0H-00-00 SAS Disk Drive

            VIO2

            lsdev –Cc disk

            hdisk0 Available 05-00-00 SAS Disk Drive

            hdisk1 Available 06-00-00 SAS Disk Drive

Because of this, the shared LUNs would be out of sync when they were allocated to the VIO servers. On VIO1, the next hdisk number would be hdisk5, while on VIO2, the next hdisk number would be hdisk2. The customer wanted to keep the hdisk numbers consistent on the two servers to simplify the process of mapping the LUNs to client LPARs. Consistent numbering would also make it a bit easier to conduct any future troubleshooting.

There was no shortage of ways to resolve this issue. The SAN team could assign some small LUNs to VIO1 only. That way both servers would have the same number of hdisks once mapping of the “real” shared LUNs began.

I found more options online (see herehere and here). Unfortunately, we couldn’t get any of them to work.

Finally, I was shown this blog post on dummy devices, which includes the following command:

mkdev -l hdisk100 -c disk -s sas -t scsd -p sas0 -w sas –d

Sure enough, this did the job:

hdisk100 Defined

lsdev –Cc disk

hdisk100 Defined   00-00-00-00 MPIO Other SAS Disk Drive

Basically, this method allows you, the admin, to add in all the dummy hdisks you need without involving SAN personnel.

When you’re using vSCSI, do you take the time to keep your hdisks in sync? In what other instances do you find it necessary to create a dummy device? How do you go about it?

The Value of Hardware Maintenance

Edit: Keep your maintenance current.

Originally posted November 25, 2014 on AIXchange

Recently, a customer was looking for help with their machine, a 7038-6M2 running AIX 5.1.

The customer attempted to call IBM for assistance. They didn’t get very far at first. This machine was announced in 2002 and withdrawn from marketing in 2005. In addition, IBM no longer supports AIX 5.1 or any previous operating system.

As I’ve often said, lots of businesses continue to run on older hardware and legacy OSs. While this speaks to the high quality of IBM systems and software, it’s still a risky venture, because even the most well-made systems will break down eventually.

In this case, the test/dev LPAR wouldn’t boot up once a D20 drawer was added to the system. Luckily, the production LPAR wasn’t impacted.

A bad boot drive was initially thought to be the culprit. Some used drives were procured and one of the disks was replaced. The intent was to boot from the remaining good disk and then mirror to the used drive. An AIX 5.1 CD was used as boot media. While the machine came up, the used disk wasn’t recognized. Was this a firmware issue? Was the wrong part number being used as a replacement?

A different used drive was tried, but it wasn’t recognized either. No one was sure what to try next. Finally, the question arose: What kind of a PMR did you try to open with IBM, software or hardware? Ah ha. The customer had tried to open a software PMR but wasn’t entitled. However, the machine was still covered under IBM hardware support.

Once a ticket was opened, the disk carrier was determined to be the problem. When that part was replaced, the machine was able to boot. One of the disks actually was bad though, so one of the replacement disks was used to create a mirror of rootvg. The system is running fine now.

The moral: If you’re still running old hardware, keep IBM maintenance on it. Or better yet, seriously consider upgrading to something newer.

On an unrelated note, I saw the following from IBM that might impact customers who order physical copies of media:

Under Software Updates: Effective November 18, 2014 customers in the USA who select the physical delivery option will be invoiced 350USD + sales tax for the order. Note that download delivery remains free of charge.

Be sure to give yourself any additional time necessary to download and burn any physical media you might need. Otherwise be prepared to pay this new fee if you still want IBM to ship it to you.

Tracking Network Devices

Edit: Still good stuff.

Originally posted November 18, 2014 on AIXchange

Which switch port is your network port plugged into?

Oftentimes this simple bit of information goes undocumented. Perhaps everything is being plugged in at a remote site by some ‘hands and eyes” guys and you’re just not sure if the cabling has been completed or if it’s correct according to the documentation you received. Or maybe you just want more information about the network device that you’re plugging into.

I was reminded of an interesting method for obtaining this information. Before I get into it, keep in mind that this might not work depending on the switch you’re connecting to or its security settings. That said, I’ve had pretty good luck with it so far.

To get this working in my environment, I first needed to see what physical cards I’d connected to the switch. I accomplished this with the lscfg command, which displayed the cards available in my system:

lscfg | grep en

In my test machine I have these ports:

To determine which ports are reporting that they’re up, I ran:

            for i in 0 1 2 3 4 5 6 7

            do

            echo ent$i ; netstat -v ent$i | grep Status

            done

I received this output:

            ent0

            No network device driver information is available.

            ent1

            Physical Port Link Status: Down

            Logical Port Link Status: Down

            DCBX Status: Enabled

            MAC ACL Status: Disabled

            VLAN ACL Status: Disabled

            ent2

            Physical Port Link Status: Up

            Logical Port Link Status: Up

            DCBX Status: Disabled

            MAC ACL Status: Disabled

            VLAN ACL Status: Disabled

            ent3

            Physical Port Link Status: Up

            Logical Port Link Status: Up

            DCBX Status: Disabled

            MAC ACL Status: Disabled

            VLAN ACL Status: Disabled

            ent4

            No network device driver information is available.

            ent5

            Link Status: Down

            Transmit and Receive Flow Control Status: Disabled

            ent6

            Link Status: Down

            Transmit and Receive Flow Control Status: Disabled

            ent7

            Link Status: Down

            Transmit and Receive Flow Control Status: Disabled

This showed me the status of every port. In my case, I know that the ports that report “no network driver information is available” are part of my Shared Ethernet adapters, so that gives me an idea of which adapters are being used by SEA on this VIO server. 

The method that I will now describe will work on network ports that don’t have an SEA on them. Maybe you have your own method to use once your port is already up and active in an SEA? If so, let me know in comments.

In above output, ent2 and ent3 are reporting that they’re up. I put a dummy IP address on them:

ifconfig en2 10.9.0.1 netmask 255.255.255.0

ifconfig en2 up

Then I ran tcpdump:

tcpdump -nn -v -i en2 -s 1500 -c 1 ‘ether[20:2] == 0x2000’

After a short wait, I received this output:

            tcpdump: listening on en2, link-type 1, capture size 1500 bytes

            08:09:13.046930 CDP v2, ttl: 180s, checksum: 692 (unverified)

                        Device-ID (0x01), length: 22 bytes: ‘ucs6120-A(SSI140206FM)’

                        Address (0x02), length: 13 bytes: IPv4 (1) 10.33.0.31

                         Port-ID (0x03), length: 12 bytes: ‘Ethernet1/20’

                         Capability (0x04), length: 4 bytes: (0x00000228): L2 Switch, IGMP snooping

                         Version String (0x05), length: 70 bytes:

                           Cisco Nexus Operating System (NX-OS) Software, Version 5.2(3)N2(2.22c)

                         Platform (0x06), length: 9 bytes: ‘N10-S6100’

                         Native VLAN ID (0x0a), length: 2 bytes: 705

                         AVVID trust bitmap (0x12), length: 1 byte: 0x00

                         AVVID untrusted ports CoS (0x13), length: 1 byte: 0x00

                         Duplex (0x0b), length: 1 byte: full

                         MTU (0x11), length: 4 bytes: 1500 bytes

                         System Name (0x14), length: 9 bytes: ‘ucs6120-A’

                         System Object ID (not decoded) (0x15), length: 14 bytes:

                          0x0000:  060c 2b06 0104 0109 0c03 0103 864f

                         Management Addresses (0x16), length: 13 bytes: IPv4 (1) 10.33.0.31

                         Physical Location (0x17), length: 14 bytes: 0x00/Lab Switch 1

            4 packets received by filter

            0 packets dropped by kernel

 I can do the same thing with en3.

From this I know what kind of switch I’m connected to, what port is it connected to, what OS the switch is running, the VLAN I’m on, the MTU size, the management address of the machine, etc.

Be sure to read the link above for additional information about ether channel and other details.

What other methods do you use to determine which the physical ports your machines are using?

Firefox SSL Fix for HMC Users

Edit: Link no longer works.

Originally posted November 7, 2014 on AIXchange

I have pretty good luck when using Mozilla with my HMCs. However, when I recently upgraded Mozilla, I encountered an issue:

An error occurred during a connection to hmc1. Issuer certificate is invalid. (Error code: sec_error_ca_cert_invalid)

    The page you are trying to view cannot be shown because the authenticity of the received data could not be verified.

    Please contact the website owners to inform them of this problem. Alternatively, use the command found in the help menu to report this

I found a solution in this technote. Although it’s referring to Domino servers, the concept is still the same.

After updating Firefox to version 31 (or later), when Firefox browser users attempt to access a MD5-based SSL certificate, generated by a Domino Web server, the connection attempt will fail with the following error: Secure Connection Failed. An error occurred during a connection to <server name>. Issuer certificate is invalid. (Error code: sec_error_ca_cert_invalid)

Firefox 31 introduces a new security library named security.use_mozillapkix_verification for strict enforcement for SSL certificate verification (see this MozillaWiki article for details).

After updating Firefox to version 31 (or later), when Firefox browser users attempt to access a MD5-based SSL certificate, generated by a Domino Web server, the connection attempt will fail with the error shown below. This includes Domino self-signed testing certificates generated from the Server Certificate Admin database or server SSL certificates generated from the Domino Certificate Authority.

You can perform the following steps on local Firefox browsers to restore the older SSL libraries for Firefox, which will allow HTTPS connections to your server.

Step 1. Type about: config in the Firefox address bar to access Advanced settings. Read the warning presented, and then click the “I’ll be careful, I promise” prompt to accept and proceed.

Step 2. Scroll down to security.use_mozillapkix_verification and double-click to toggle its value from true to false (or, right-click on it and select Toggle).

Once I did this, I was able to connect to my HMC as usual. Hopefully this tip will help should you run into this same issue in the future.

A Useful Data-Compression Option

Edit: Have you run this tool?

Originally posted November 3, 2014 on AIXchange

Perhaps you’re interested in compressing data on your IBM storage devices. But do you have any idea how much of your data is actually compressible?

The Comprestimator utility is designed to tell you how much actual compression you’ll achieve without actually compressing your data. Download version 1.5.1.1 here. From the same link, here’s a detailed description of the tool: 

“Comprestimator is a command line host-based utility that can be used to estimate an expected compression rate for block devices.

The Comprestimator utility uses advanced mathematical and statistical algorithms to perform the sampling and analysis process in a very short and efficient way. The utility also displays its accuracy level by showing the maximum error range of the results achieved based on the formulas it uses. The utility runs on a host that has access to the devices that will be analyzed, and performs only read operations so it has no effect whatsoever on the data stored on the device. The following section provides useful information on installing Comprestimator on a host and using it to analyze devices on that host. Depending on the environment configuration, in many cases Comprestimator will be used on more than one host, in order to analyze additional data types.

In order to reduce the impact of block device and file system behavior mentioned above it is highly recommended to use Comprestimator to analyze volumes that contain as much active data as possible rather than volumes that are mostly empty of data. This increases accuracy level and reduces the risk of analyzing old data that is already deleted but may still have traces on the device.

Comprestimator version 1.5 adds support for analyzing expected compression savings in accordance with Storwize V7000, SAN Volume Controller (SVC) and FlashSystem V840 storage systems running software version 7.3. Among other enhancements in the software, version 7.3 adds support for the 2014 hardware models Storwize V7000 Gen2, SVC DH8 and FlashSystem V840 AC1.

Comprestimator is supported and can be used on the following client operating system versions:

  • Windows 2003 Server, Windows 7, Windows 2008 Server, Windows 8, Windows 2012
  • ESXi 4, 5
  • AIX 6.1, 7
  • Red Hat Enterprise Linux Version 5.x, 6.x
  • HP-UX 11.31
  • Sun Solaris 10, 11
  • SUSE SLES 11
  • Ubuntu 12
  • CentOS 5.x

Comprestimator is designed to scan any block device that is readable by the OS itself. This typically includes devices managed by logical volume managers (LVMs) or partitioned by the OS. However, for practical reasons, since compression is applied to physical volumes, it is recommended to estimate compression by running Comprestimator on the same block device/physical volume that will be compressed, and not on a logical volume, which may be spanning on those volumes. It is thereby highly recommended to always analyze the native block-device when using Comprestimator.

Some volume managers “reserve” some of the LUN capacity for internal use. Since Comprestimator reads directly from the block device, some of these IOs may fail. The tool will tolerate up to 1% failed IOs and a scan will be aborted if this threshold is reached.”

Rather than guess what you might save on disk space when you turn on compression, try this tool and learn some real-world numbers based on your actual environment.

Reformatting IBM i pdisks to AIX hdisks

Edit: It has been a while since I have needed to do this.

Originally posted October 27, 2014 on AIXchange

Recently I needed to reformat an IBM i LPAR as an AIX LPAR for some testing. After defining the partition and reusing the IBM i hardware, I tried to boot it from physical install media (as there is no VIOS on this machine).

The OS would boot, but wouldn’t recognize any of the disks. If I went into maintenance mode and started a shell, lscfg displayed pdisks, but not hdisks.

This made sense, as these disks were still set up as an IBM i raid array. I needed to format them so that AIX could use them. The AIX boot media didn’t have the capability to format the disks; fortunately, IBM developerWorks had what I needed:

The IBM Standalone Diagnostics CD-ROM provides hardware diagnostics and service-related utilities for POWER, PowerPC, eServer i5 system with common pSeries I/O, and RS/6000-based systems. The standalone diagnostics CD-ROM would be used in the following situations when it makes sense to test the hardware independent of the operating system:

* When there is no operating system installed on a system or partition

* When the operating system does not have support for the service related function you wish to perform

* When there may be a problem with the boot device

* When the service documentation specifically recommends running standalone diagnostics

 The actual IBM Standalone Diagnostic CD-ROM can be downloaded here.

Diagnostics, which are available for AIX and Linux systems and logical partitions, can help you perform hardware analysis. If a problem is found, you will receive a service request number (SRN) or a service reference code (SRC) that can help pinpoint the problem and determine a corrective action. Additionally, there are various service aids in the diagnostics that can help you with service tasks on the system or logical partition.

You can run the IBM Standalone Diagnostic the following ways:

 * Running the eServer stand-alone diagnostics from CD-ROM
 
 * Running the eServer stand-alone diagnostics from a Network Installation Management server

I downloaded the standalone diagnostic CD and was able to burn the .iso image to physical media and boot from it. From there I went in and changed the pdisks to hdisks, formatting them as JBOD disks. Then I swapped CDs and put the install media back in the drive. AIX was able to recognize the disks, the OS was installed, and the test LPAR was easily built.

A System Outage, and the Failures that Led to It

Edit: Some links no longer work.

Originally posted October 21, 2014 on AIXchange

Old Power servers just run. Most of us know of machines that sat in a corner and did their thing for many years. However, as impressive as Power hardware is, running an old, unsupported production server with an old, unsupported operating system isn’t advisable. This is one such story of a customer and its old, dusty machine that sat in a back room.

This customer had no maintenance; they simply hoped their box would continue to hum along. To me, that’s like taking your car to a shop and telling the mechanic: “The check engine light has been on for years, and I’ve never changed the oil or checked the tires. Why are you charging me so much to fix this?”

I imagine some of you are thinking about the fact that older applications can be kept in place by running versioned AIX 5.2 or AIX 5.3 WPARs on AIX 7. That option wasn’t selected in this case, however. This was a server running AIX 5.2 and a pair of old SCSI internal disks that were mirrored together in rootvg. Eventually, one of those disks began to fail.

When did it begin to fail? No one knew, because no one monitored the error logs. When the machine finally had enough, it crashed. Reboots would stop at LED 0518. In isolation, that’s no big deal. Just boot the machine into maintenance mode and run fsck.

In this case though, going into maintenance mode only resulted in more unanswerable questions. Where’s the install media? No one knew. No one knew where the most recent mksysb was, either. Ditto for the whereabouts of the keyboard for the console. No one knew. Time to start sweating.

Because this was a standalone server, there was no NIM server. Because it was a production machine, the outage affected several locations. Booting an older version of AIX and then trying to recover a newer version on rootvg is often problematic, and this instance was no exception. Though the customer could get AIX 5.2 media shipped to them from another location, they’d have to wait a day, and there was no guarantee that this version would be at the same level as the operating system they were using.

It turns out this customer was very, very fortunate, because someone, somehow located a 4-year-old mksysb tape. The machine booted from the tape drive and the customer was able to get it into maintenance mode, access a root volume group and run fsck on the rootvg filesystems. Some errors were corrected and the machine was able to boot. From there it was a relatively simple case of unmirroring the bad disk and replacing it with a new disk.

While naturally I’m happy that this customer resolved their issue, I present this story as a cautionary tale. Think of all the things that went neglected prior to the disk failure. Filesystems and errpt weren’t monitored. While nightly data backups were being taken, there were no recent mksysb backups. It’s possible that the last mksysb was taken at the time of system installation. There were no OS disks on hand. Only luck kept this customer from experiencing substantial downtime and losing significant business.

Now consider your environment. Do you occasionally take the time to restore your critical systems on a test basis, just to prove that you could restore them in an actual emergency? If you couldn’t boot a critical system, could you recover it? How long would it take?

More on the New HMC Release

Edit: I still enjoy the new interfaces. Some links no longer work.

Originally posted October 14, 2014 on AIXchange

After writing about the HMC’s new look, I found that one of my options wasn’t working as expected. When I clicked on Manage PowerVM, I got this error.

I opened a ticket with IBM Support, which sent me to this link to run this procedure:

The apply of V8R8.1.0 Service Pack 1 may cause some VIOS related tasks to fail. Impacted HMC tasks include Manage PowerVM and Manage partitions task in the new “enhanced GUI” as well as the Performance and Capacity Monitor (PCM). External applications using the HMC REST API such as IBM PowerVC are also impacted. The error text will typically include the error message “3003c 2610-366 The action array contains an undefined action name at index 0: VioService.

Contact IBM support for the circumvention until a fix is available.

Symptom: The apply of V8R8.1.0 Service Pack 1 may cause some VIOS RMC related tasks to fail. Impacted HMC tasks include Manage PowerVM and Manage partitions task in the new “enhanced GUI” as well as the Performance and Capacity Monitor (PCM). External applications using the HMC REST API such as IBM PowerVC are also impacted.

For example, the Manage PowerVM task fails with the error:

Exception occurred in Querying for Media Repositories from vios P8TVIO1 with ID 1 in CEC 8286-41A*TU20305 – Network interruption occurs while RMC is waiting for the command execution on the partition to finish. The operation might have caused CPU starvation or network disruption. The operation could have completed successfully. (3003c 2610-366 The action array contains an undefined action name at index 0: VioService. )

This procedure requires you to get a root shell on the HMC, which usually requires the assistance of IBM Support.

However, right around the time of this error, I saw a post on IBM developerWorks that notes another way to become root using the ‘shellshock’ bash security bug. (That post has since been removed, though HMC fixes to fix bash are available from IBM’s Fix Central.)

At the end of the instructions, you’re told to wait for the RMC connections to restart.  In my case, nothing happened.  I rebooted the HMC, and found this new error:

Support pointed me to this procedure,but then recommended I apply the fix:

“Querying VIOS using the HMC REST API fails with “Exception occurred in Querying for Media Repositories from vios <vios name> with ID <vios lpar id> in CEC <server MTMS> – Unable to connect to Database”. This error usually indicates that the VIOS database is corrupt and needs rebuilt.”

“At VIOS level 2.2.3.1, the resolution is to apply fix VIOS_2.2.3.1-IV52899m1a. No further recovery is needed.”

Seeing as I was at 2.2.3.3, I thought it was an odd suggestion, as 2.2.3.3 already had the fix. However, in the interest of completeness, I wanted to mention the procedure itself. Perhaps it will help with any issues you might encounter.

I decided to update the firmware on the server and reboot all of the LPARs. The Manage PowerVM option started working. Here are some screen shots.

Right-clicking on the active VIO server brought me to a view that looked pretty similar to what I was used to:

When I went to adapter view, it appeared something was missing:

I’ve been able to use the GUI to set up server templates and shared Ethernet adapters — all without needing to login as padmin to the VIO servers. Keep in mind that the “classic” mode still works exactly as you’re used to. The same is true for all of the VIO commands you’re used to. 

As I continue to learn about this HMC code, I’ll pass along more information.

POWER8 E870 and E880 Offer Impressive Performance

Edit: Link no longer works. This announcement feels like it happened yesterday.

Originally posted October 3, 2014 on AIXchange

New POWER8 server models were announced today: the scale-up E870 (9119-MME) and E880 (9119-MHE) along with an Ubuntu Linux only model called the S824L. The E (with the “E” denoting “enterprise”) models will have I/O drawers available. An IBM Statement of Direction (SoD) indicates that I/O drawers will be available for the S models in 2015. The E870 and E880 will be generally available Nov 18. This blog post provides details on the E models.

These new systems are a blend of the 795 and 780/770s. Architecturally these new machines are similar to the Power 795, but the packaging in a 19-inch rack with multiple CECs is similar to the Power 780/770. The preliminary CPW and rPerf numbers that I saw during the training (that were still being tested and confirmed) were substantial and impressive. I‘m sure we will see more information around these numbers that I did not have available to me at the time of writing.

The E870 is available as a one- or two-node system, and the E880 will eventually be available as a one-, two-, three- or four-node system, although at GA it will only be available as a one- or two-node system. The third and fourth node configurations are planned to GA in June 2015.
Each node in the E870 and E880 will have eight PCIe Gen3 x16 slots for low profile PCIe adapters; optionally these slots will be used as optical interfaces to the I/O drawers. The nodes are 5U in size and come with different core densities and speeds. The E880 will have a 32-core 4.35 GHz option with a SoD for a 48-core node. The E870 will have a 40-core 4.19 GHz option or a 32-core 4.02 GHz option. All nodes in a server must have identical processors, you cannot mix and match nodes. This means that the maximum for the E880 32-core node will be 64 cores at GA, with 128 cores in 2015. An SoD indicates IBM plans for 192 cores in 2015 using 48-core nodes. The E870 will have a maximum of 80 cores with the 40-core node, and 64 cores with the 32-core node.

There are 32 memory slots per node. These systems are using custom DIMMs that are running at 1600 MHz DDR3, with the E870 going up to 2 TB per node (with an SoD taking them to 4 TB per node in 2015) and the E880 going up to 4 TB per node when you use the largest DIMM sizes that are currently available.

There are no integrated SAS bays or SAS controllers in the node. There is no integrated DVD bay or DVD controller in the node. There is no integrated Ethernet in the node and no tape bay in the node. The node is strictly for power supplies, CPUs, memory and PCI slots.

A new concept is the system control unit, which is a 2U drawer that connects to the server at the midplane. It must be immediately physically adjacent to the system nodes. It holds the service processors, the HMC ports, the master system clocks, the operator panel, the VPD and an optional DVD. The system control unit also contains the redundant power, hot plug clock and battery. The idea is that all of the important components in the system control unit are redundant and these components are not a single point of failure for the machine.

A one-node E870 or E880 will take up 7U in a 19-inch rack, two nodes will take up 12U, and eventually when we get to three nodes it will take up 17U, and four nodes will take up 22U. IBM recommends that we leave 1U open at the top and/or bottom of the rack for easier cable management. IBM also recommends that we mount 1U power distribution units (PDUs) horizontally instead of in the side pockets to make cabling easier instead of the PDUs that go along the sides of the racks as many of us are used to.

The I/O expansion drawer connects to the nodes using two PCI slots from the node via an optical cable. For each drawer you attach, you gain 12 slots, but you effectively “lost” two slots on the system node, for a net gain of 10 slots for each I/O drawer. For this first announcement we can attach up to two I/O drawers per node, with a total of four per system in 2014. If you attach two drawers to a two-node system this will give you 56 total I/O slots. The SoD states that IBM plans to support up to four I/O drawers per node which would take us to eight I/O drawers or 96 I/O slots on a two-node system. For this 2014 announcement you can have either zero or two drawers per node. There is no option to just do three drawers, for example, at this time. IBM issued an SoD for the I/O drawers to connect to the S models, but that will not be available until next year, and new firmware will be required to take advantage of I/O drawers.

Using physical I/O you can run: 
AIX
AIX 7.1 TL3 SP4 with APAR IV63332 or later
AIX 7.1 TL2 SP6 or later (Jan 2015)
AIX 6.1 TL9 SP4 and APAR IV63331 or later
AIX 6.1 TL8 SP6 or later (Jan 2015)

With VIOS you can run:
AIX 7.1 TL2 SP1 or later
AIX 7.1 TL3 SP1 or later
AIX 6.1 TL8 SP1 or later
AIX 6.1 TL9 SP1 or later

IBM i
IBM i 7.2 TR1 or later
IBM i 7.1 TR9 or later

Linux
RHEL 6.5 or later
SUSE 11 SP3 and later

VIOS
VIOS 2.2.3.4 with ifix IV63331 or later
VIOS 2.2.2.6 or later (Jan 2015)
The firmware level will be 8.2.0

Other items of note in the announcement include:

  • If you want to do a model upgrade and retain the same serial number you can migrate a 770 D model to an E870, and you can upgrade a 780 D model to an E880.
  • The 5887 EXP24S I/O drawer is supported on these new machines, and if you want internal boot disks, this drawer is going to be the method you use to achieve that.
  • The PVU for the E870 and E880 will be 120, for AIX these machines will be a medium software tier and for IBM i these will be P30 machines.
  • Because these servers will pack a great deal of compute capability in a small footprint, you can definitely hear the fans, especially when they speed up to handle additional load. You may want to consider acoustic doors in your racks.
  • A new S824L model is planned to GA on Oct. 31.It is designed for high-performance analytics, big data and Java application workloads. It will incorporate NVIDIA El Capitan K40 GPU adapters and will run Ubuntu 14.10 exclusively. Virtualization will not be available for this machine.
  • There will be 2x memory available for the S824, you will be able to get 2 TB into the machine with 128G DIMMs via an RPQ, but mixing of DIMM sizes on the machine isn’t allowed.
  • The S822L and S822 are NEBS Level-3 and ETSI certified for use by clients that require a hardened infrastructure, they are designed for “extreme shock, vibration, and thermal conditions which exceed normal data center design standards.”
  • An RPQ is available to allow 900W 100-120V power supply options for four-core or six-core S814 rack-mounted servers.

These are just some of the highlights from the announcements. I have been to a few training sessions so far and there is even more information than I was not able to cover here, but I wanted to give you a flavor of what was coming in the near future. You can read the IBM news release here.

Power Systems, Linux on Power Events Scheduled

Edit: I love this type of training.

Originally posted September 30, 2014 on AIXchange

I recently attended a no-charge Linux on Power workshop that’s currently touring the U.S. In conjunction with this 1-day workshop, a 2-day Power Systems virtualization class is traveling the country as well. Both events are geared toward IBM customers, so see your local IBM rep or business partner representative to be nominated to attend.

IBM has also said that if there’s sufficient demand (12-20 participants), the company will attempt to add events in cities that aren’t currently on the workshop schedule. Email me and I’ll get you in touch with the workshop coordinators.

Here’s the current schedule:

St. Louis
Power virtualization class: Sept 29-30
LoP workshop: Oct. 1

Coral Cables, Fla. (Miami)
Power virtualization: Oct. 21-22
LoP: Oct 23

Costa Mesa, Calif (southern California)
Power virtualization: Oct. 28-29
LoP: Oct. 30

Bethesda, Md. (Washington, D.C.)
Power virtualization: Nov. 4-5
LoP: Nov. 6

Jacksonville, Fla.
Power virtualization: Nov. 11-12
LoP: Nov. 13

Malvern, Pa. (Philadelphia)
LoP: Nov 13

Schaumburg, Ill. (Chicago)
Power virtualization: Nov. 18-19
LoP: Nov. 20

These details are provided by IBM:

 Linux on Power objectives

• Provide a Linux on POWER experience

• Maximum hands-on engagement

• You perform activity AS we present

• The Lecture IS the lab

• Build confidence in the ability to deploy Linux on POWER

• Convey IBM’s renewed commitment to Linux on POWER

• This is not a PowerVM class

• We can speak to any PowerVM questions you have

• PowerVM is not the core of the content

• Your class LPARs and their virtualization are already configured

Linux on Power Agenda

• Lab Introduction

• ISO Media install of Red Hat 6.5 on LPAR

• Linux on POWER Trends and Directions

• Network install of Red Hat 6.5 on LPAR

• Filesystems and LVM

• Graphical Desktop

• Commands and Additional Information

Power Systems Virtualizaton Workshop Highlights:

Overview of POWER Architecture, Power Systems servers, and virtualization concepts

Management Appliance architecture and functions

Hardware Management Console (HMC)

Flex System Manager (FSM)

Virtual Machine creation

Partitioning Configuration and requirements

PowerVM Enterprise Edition virtualization concepts

Micro-Partitioning

LPARs / Virtual Machines

Memory virtualization (AME / AMS)

Shared Storage Pools

PowerVM Live Partition Mobility (LPM / Migration)

Introduction to PowerVC

Introduction to SmartCloud Entry for Power

From my first-hand experience, I can certainly recommend the Linux on Power workshop I attended. I believe both events are a good opportunity to gain more skills and bring back hands-on experience and knowledge to your organization.

A New Look for the HMC

Edit: The interface keeps changing for the better.

Originally posted September 23, 2014 on AIXchange

I decided to update to the latest (as of this writing) HMC code, V8R8.1.0M1.

I went to IBM Fix Central and found MH01420, and downloaded the necessary .iso image.

Once I had done this, I followed the second half of this document. When I clicked on “update HMC,” I just pointed it to the server I’d downloaded the .iso image to.  It very quickly did the updates, then rebooted.

A normal HMC reboot is generally fairly fast. However, with an upgrade, it takes much longer, so be patient. Then, after the reboot, it also takes awhile to actually start the console.  Once it did start, I immediately realized the console had a different look and feel.


Clicking on the “learn more link” at the bottom of the login screen opens a help window opens that displays the differences between the “classic” and “enhanced” login tasks.  Learn more about this here:

“Learn about the differences between the Classic and Enhanced graphical user interface (GUI) in the Hardware Management Console (HMC).

Select which software interface to use when you log in to the HMC. The Classic interface provides access to all traditional functions of the HMC and the Enhanced interface provides both redesigned and new virtualization tasks and functions.

The Classic GUI is available by default on the HMC Version 8.1.0, or earlier.

The Classic GUI is available on the HMC Version 8.1.0.1, or later by choosing the Classic option while logging into the HMC.

The Enhanced GUI is available on the HMC Version 8.1.0.1, or later by choosing the Enhanced option while logging into the HMC.”

Be sure to check out the table from the link to see the differences between the Classic and Enhanced GUI tasks.

Initially, things didn’t appear all that different on the Enhanced GUI, but then I selected the server I wanted to manage.  Then I had some different menu options.

 Once I expanded them all, this was what I saw.

The rest of the menu did not fit in the screenshot. The following menu items are what you see beneath the capacity on demand menu item and the menu continues along the right side of the panel.

I immediately tried the performance option, and was pleasantly surprised with what I saw.

I also took a look at Manage PowerVM, and saw this.

The goal here is to provide the capability to do all of your management from the HMC, with no need to login to the VIO server.

At the bottom of the Manage PowerVM task there was a learn more button.  I clicked and saw this. 

When I clicked on Create Partition from Template I saw:

When you click on a partition, you now get a new set of menus. 

The manage menu gives you a new way to look at your profiles.
 

Click on advanced and you’ll see this.

hmc12.png

I’ve just described getting started. I’ll do more testing and exploring with this new interface. If you have something specific you’d like me to try out/write about, please let me know in comments.

Power and AIX News Via Twitter

Edit: I am still active on twitter. Some links no longer work.

Originally posted September 17, 2014 on AIXchange

While I remain active on Twitter, it’s been awhile since I’ve highlighted tweets on this blog. For this week though, here’s a sampling of fairly recent tweets that caught my interest:

 * Torbjörn Appehl (@tappehl) — Some nice news regarding direct attached external storage for #ibmi on #powersystems 

* Mike Krafick (@MKrafick) — Quick Tip: Monitoring memory on a #AIX server. So easy, even a #DB2 DBA can do it! 

* Jyoti Dodhia (@JyotiDodhia) — TIP: #VIOS networking tips and techniques by @GlennRobinsonVS

* Brian Smith (@brian_smi) — Using #AIX’s built in Performance Recording 

* Nigel Griffiths (@mr_nmon) — New #SharedStoragePool video: Experiments in Repository disk destruction shows SSP carries on & its easy to remake.

* Jyoti Dodhia (@JyotiDodhia) –HOT >> Replay: #Linux on Power for #AIX / #IBM i guys – Doing it Easy Way with @mr_nmon

* COMMON A Users Group (@COMMONug) — FREE webcast “IBM Power Announcements – #ibmi and Power” by @IBMiSight, @Steve_Will_IBMi, Mark Olson: Oct 6 @ 9am CT 

* Gareth Coates (@power_gaz)– What’s in #IBMPowerSystems #HMC V8R810 SP1? See [here]. Classic and Enhanced GUI, System and LPAR Templates etc. I like it!

* chmod666.org (@chmod666) — New post .. finally : Exploit the full potential of #PowerVC by using Shared Storage Pools & Linked Clones

* Site Ox (@siteox) — Linux on Power8 is available NOW at Site Ox. Free 2 Weeks! Site Ox is the official provider of Linux on Power for IBM. 

As I’ve often said, Twitter has much to offer all of us who work with IBM Power Systems and AIX. So who are you following on Twitter? How active are you? And for those of you who don’t use Twitter, how do you keep current with the goings-on in the world of Power and AIX?