Project Monocle Will Simplify Patching

Edit: Some links no longer work.

Originally posted July 2017 by IBM Systems Magazine

How do you go about determining what fixes you need for your system to remain up to date?

Do you use FLRT or FLRT LITE?

Do you just log into IBM Fix Central and start looking at what is available?

What if you had a dashboard that you could log into that showed you your system names across your whole environment, along with the current level of firmware and which AIX and VIOS OS version is running? What if it also showed you the recommended versions to upgrade to? What if you could also see the machine type and serial along with the IP address of your LPAR or frame? Would you be interested in getting a tool like this running in your environment?

What if it had a dashboard that gives an overview of what needs to be updated and what systems are up to date? And what if that tool allowed you to create and share plans with other stakeholders in your organization to help with change management planning? Moreover, what if it allowed you to filter on OS, or firmware, or VIOS. What if you could choose the types of machines, or the current levels you wanted to drill down on.

I was recently given access to a demo version of Project Monocle, which is a tool that provides for all of the functionality I described above, and I have to say I am very impressed with it. I look forward to getting it up and running so that I can do further testing in my environment. Right now, the tool is available at no charge as a technology preview, so I would suggest reaching out to the team ASAP so you can try it out for yourself. In order to get tools like this created and IBM resources assigned to work on them, users need to let IBM know what improvements they want to see. This is an example of a tool that can help simplify the lives of Power Systems administrators.

There were some interesting blog posts written about the creation of the project, including these:

http://www.jaredcrane.com/ibm-project/

http://www.stefanieowens.com/project-monocle/

““This is the best story of design at IBM in the last three years… The team came to them saying they need one-click firmware updates for Power Systems… By doing field research, they found out that this was a human problem, not a system problem. Not only did research inform what they built, but what they built was beautiful itself as well.” — Phil Gilbert, General Manager of Design at IBM

The blog posting continues with: “The Monocle team was tasked to explore the field of updates and upgrades that are mission critical to keeping all of IBM Power Systems server products secure. Think: these servers are those that run major data centers around the world that are the backbones of credit card companies, major retailers, and even governments in some instances. The sponsoring product team originally came to us asking for a ‘one-click update’ for all Power Systems. Through user research, we discovered that a one-click update was actually not the right way to go.

“We found that the current process of updating servers is rigorous and time intensive. It causes headaches for enterprises to not only find the appropriate fixes their servers need, but then it’s even worse to actually schedule downtime on the servers to fix the issues and report those repairs for security compliance purposes. It’s next to impossible to automate this process; in fact, automating it could even make matters worse! Even after that process is completed, an enterprise still has to report on its security patching in order to maintain compliance with industry regulations. As one of our Sponsor Users identified during an interview, planning and managing security patches and updates is not only a pain to perform, but also to report on.

“Not only does the security patching process take a lot of effort, but it is vitally important to the safety and security of the enterprise data; one mistake here could take an entire organization down.

Project Monacle is also described here, along with some screen shots and more details:

https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/Power Systems/page/Monocle Patch Management

“We recognize that problem in IBM and set out to make your life easier. You may have heard of Project Monocle, but if not, it is a zero installation web application technology preview, providing a consolidated view of your inventory with the ability to drill down and view patch compliance. It actually works off existing IBM technologies such as the Technical Support Appliance (TSA), Fix Level Recommendation Tool (FLTR), and Fix Central. If you do not currently have TSA, contact the development team at bmonocle@us.ibm.com to get started. If you do have TSA and are interested in using Monocle send an email also to bmonocle@us.ibm.com and ask to get connected to Monocle.

“Finding the patch you need can be a daunting task. Project Monocle uses Fix Central and FLRT to provide you with recommended levels for each system type: AIX, IBM i, Firmware, VIOS, and HMC. Compare recommended and latest versions to see which is right for your environment. You can even see all of the APARs that are part of each update/upgrade to see how they’ll affect your systems. Need to build a report for your internal review teams? No problem, that data is all there at your fingertips…

There is more information about the technology preview that is available here:

“Project Monocle is a zero installation web application, providing a consolidated view of your inventory with the ability to drill down and view patch compliance.

“This Technology Preview provides the opportunity to use Monocle at no charge. All customers who have, or are eligible to get, the Technical Support Appliance (TSA), can gain access.

If you do not currently have TSA, please contact the development team at bmonocle@us.ibm.com to get started.

If you are not familiar with TSA, you can get more information and watch informative videos here

“Benefits IBM Technical Support Appliance (TSA) helps you:

-Streamline IT inventory management by intelligently discovering inventory and support-coverage information for IBM and non-IBM equipment

Improve technical support management with analytics-based reports and collaborative services

-Mitigate costly IT outages via operating system and firmware recommendations for selected platforms

“How it works -Configure TSA to discover basic support-related information such as hardware inventory, code levels, virtual machines, and OS information from designated devices.

-Inventory information is shared with IBM TSS using security-rich transmission protocols.

-IBM uses advanced analytics and worldwide support knowledge to help identify code currency and support contract vulnerabilities.

-Continuously collaborate with your IBM TSS focal point.”

Image your workflow in this new environment. TSA is gathering information about your machines from your HMC, and it is sending that information to IBM. You are then able to log in to Project Monocle and see almost real time information about your systems and how current your environment is as far as patching is concerned.

Imagine you are now able to set up upgrade plans, share those plans with others, let them approve or deny them, and have an audit trail of the plan and subsequent decisions that is available for review by interested parties.

Patching is a critical component of maintaining your systems, and this is a tool that can simplify the data gathering and decision making. You will be able to tell in an instant which systems need to be patched and what level they should be running, all from one screen.

Additional Resources:

TSA solution overview http://www.ibm.com/services/us/en/it-services/technical-support-services/technical-support-appliance/

Download the TSA image file and setup guide: https://www.ibm.com/support/fixcentral/swg/selectFixes?parent=ibm~Other%2Bsoftware&product=ibm/Other+software/Technical+Support+Appliance&release=All&platform=All&function=all

The Great Debate: AIX Versus Linux

Edit: I talk about this all the time.

Originally posted May 2017 by IBM Systems Magazine

In computing circles that I’m involved with, the debate rages on: AIX versus Linux. Administrators wonder, “Why would anyone want to keep running an OS that’s supported by one single vendor? Why wouldn’t you want to move everything to the shiniest and newest operating system and get away from ‘legacy’ enterprise computing?”

The odds are high that most AIX administrators have used both AIX and Linux and are well versed in both. Those that have a background in both OSs are better able to have informed discussions about the pros and cons of the environments compared to someone that has never used AIX, but that certainly doesn’t stop them from having an opinion.

Similar debates have raged on for years within the mainframe community. I’ve lost track of the number of times someone has declared the mainframe to be dead every time a new technology comes along. But when you look at the volume of critical transactions that still happen on the mainframe, it’s hard to believe that it’s going away any time soon.

AIX Advantages

The AIX debate is a little bit trickier. In many cases, it’s easy to port away from AIX and run on Linux or Windows. I may be a dying breed, but I still think that AIX is the premier UNIX flavor that is available today, chiefly because the hardware and OS have been coupled together to provide enterprise level reliability, availability and serviceability.

This isn’t an OS that’s running in a legacy or maintenance mode. The latest version of AIX, 7.2 TL0 was released in December of 2015, and AIX 7.2 TL1 was released in November of 2016. One of the highlights with TL1 is the ability to install service packs and technology levels without rebooting. The platform prioritizes high levels of uptime for critical workloads, and is well-suited for environments where downtime costs real money and reliability is a must.

For example, instead of bolting on a software-based hypervisor, POWER systems natively have hypervisors built into the hardware. By using VIOS and AIX together with the POWER hardware, you have an integrated stack that comes from one company. If something goes wrong, it’s much easier to get help from that single vendor. I’m not opposed to running Linux workloads; I just think that AIX is a more mature and robust OS. If given the opportunity to run Linux, I would consider POWER as a candidate to run my Linux workloads.

Breaking Down the Differences

It was interesting to replay the presentation that Andrew Wojnarek made to the Philadelphia Linux User Group on April 11. It’s nice to see that I’m not the only one who thinks that there are real advantages to the AIX environment.

Wojnarek supports a large fleet of machines—roughly half AIX and half Linux—and he says he has a pretty good feel for what it is like to administer both environments. He goes through the basics of AIX and why you would run it. Some of his arguments in favor of AIX include things like standardization—i.e., you can run the same OS on small servers and huge enterprise servers. Compare that to the subtle differences you will find between Redhat, SUSE, Debian, Ubuntu, etc.

He reminds us that when we are working with AIX we are in a ‘walled garden.’ He points out that there’s a standard way doing things with standard tools and commands. He talks about the built-in Logical Volume Manager, and the ability that we have with JFS2 to both increase and decrease the size of filesystems while they are online, which can be problematic with other filesystems on Linux depending on the type of filesystem that you are running. He talks about mksysb, the built-in tool to make backups that can be used to restore your server, either to the same hardware you took the backup from or to other hardware in your environment.

Device handling is a breeze on AIX. In Linux you have to echo values and edit files, whereas in AIX you just chdev a device. To discover something new, you run cfgmgr. To list attributes you run lsattr. Things are just easier and more consistent.

Wojnarek’s presentation isn’t an AIX love fest, however. He does discuss what he dislikes about the OS, and there’s a good discussion toward the end with the user group members. I recommend you watch the replay.

Some other advantages of AIX that weren’t in the presentation include the ability to use alt_disk_copy and alt_disk_upgrade to have online copies of your rootvg and to actually upgrade your running OS, which you can activate the next time you reboot. If you run into problems, you just reboot from the original set of disks.

Moreover, AIX has the advantage of having IBM PowerHA high availability software integrated into the OS at the kernel level and mainframe heritage virtualization baked into the hardware, not as an add-on hypervisor. AIX on enterprise hardware has built-in error reporting and diagnostics, and when call home is enabled, we might find an IBM CE dispatched to fix a problem before we even knew anything about it.

Consider Your Needs

Instead of all the arguing about which OS is better, sometimes it is worth stepping back and thinking about who is using it and why. Why do they want uptime and reliability? Why is it worth paying for hardware and software, compared to getting commodity hardware and a virtualization solution?

I’ve heard some great analogies over the years, including this one: Both a kayak and a container ship are seafaring vessels. One is better suited for taking large amounts of cargo across long distances. The smaller solution might get the job done, but you want to find and use the method that is suitable to the job at hand. Nobody would balk at spending more money on a container ship if that was the best solution. The same should hold true in the computer room.

Of course, there are some disadvantages with AIX. Perhaps you want to run the same flavor of Linux on your desktop and server—you can’t do that with AIX. Or maybe you want to learn AIX, but you don’t have access to education or hardware. The IBM Academic Initiative helps to fill the education void, but access to hardware is a legitimate barrier to those that want to learn more about the platform.

It can seem harder for someone to learn ksh if all they ever knew was a Windows or MacOS GUI and bash on Linux. There’s a learning curve with AIX, but that’s true of any OS—it takes time to become proficient.

I know the world loves Linux, but there are still many of us out here who love AIX. Linux users would be well-served to objectively listen to the key points in this never-ending debate to see if the advantages that AIX users take for granted might benefit their environments.

How to Download Fixes

Edit: Still a good post.

Originally posted April 2017 by IBM Systems Magazine

I still find customers that are unsure of how to download fixes, so I want to cover the steps that I use when I download fixes for AIX 7.2 as an example use case.

When I download fixes from IBM, I go to IBM Fix Central

As it states on the website, “Fix Central provides fixes and updates for your system’s software, hardware, and operating system.”

You can either find or select a product from that initial landing page. In my example case, I am going to find product. I search for AIX, and I select version 7.2 fixpacks as you can see in Figure 1 below.

Figure 1

After clicking on continue, I decide I want to get the level 7200-00-03-1642, so I select that Service Pack, as in Figure 2 below.

Figure 2

I then select continue in order to proceed.

IBM has restricted operating system fixes to machines that have current maintenance agreements with IBM, so I need to enter the machine type and serial number of the machine that I am going to install the fixes on. (See Figure 3 below; click to view larger.)

Figure 3

After putting in the correct information and selecting continue on that page, the website comes back and has me agree to terms and conditions before I am able to download the files. (See Figure 4 below.)

Figure 4

After clicking on the ‘I agree’ option, I can go ahead and download the files using Download Director, if I choose this option they will be download directly to my workstation. However, there may be a need for me to download the fixes directly to a machine in my computer room. This option will assume that the machine in question has internet access.

It can be a tedious process to download many gigabytes of files to my laptop, then turn around and move those same files to a machine in the computer room, especially if the option exists to perform this operation in one step. This is especially so in an environment where I may have relatively fast download speeds, but my upload speeds into my computer room are constricted. Some admins may find that this is the case when they are working from home, or when their office WAN connection is not very fast. Instead of spending all of that time moving files around, many times I prefer to create disk space on a server, and download the fixes directly to that server. This is very useful if the computer room has a fast internet connection. (See Figure 5 below.)

Figure 5

On the far right side of the screen I am able to find a section where I change my download options. My options will consist of using Download Director, using bulk FTPS, using HTTPS in my browser, or ordering the fixes on physical media and have IBM send them to me so that I can load them into a DVD drive (or other optical device) and use them that way. (See Figure 6 below.)

Figure 6

Ordering the fixes on media from IBM can be a good choice: Having the media on hand can make it pretty easy to find a particular level of AIX or AIX fixes over time. This of course assumes you have a good system of tracking your physical media. It can also be a good choice if you have limitations on your internet speeds or connectivity, and can be useful in a disaster recovery scenario or other recovery situation that might involve bootable media. The downside to this option is that now IBM charges customers for using this option, along with the delay that it may take to get IBM to ship your fixes, so many customers choose to download them instead.

Download Director and HTTPS in my browser will both save the files to my local workstation, but in this scenario that is not the way I want to obtain the fixes. In the past, we were able to select bulk FTP as a download option, but now there is a relatively new change where we have to use bulk FTPS instead.

After selecting bulk FTPS, I get my order number, the number of files, the total size I am going to be downloading, along with the name of my FTPS server and the user ID and password I should use for my download. There are also FTPS hints. There is a statement that informs us that on AIX clients we should use ‘ftp -s’ to start the FTPS session, and then enter passive mode immediately. Their example has us run the commands

ftps> passive
ftps> binary
ftps> mget *

This should be familiar to you if you used bulk ftp as a download option in the past.(See Figure 7 below; click to view larger.)

Figure 7

So why was there a change by IBM to FTPS? This FAQ can help provide some answers. In short, the change allows for encrypted communication and secure bulk FTP download.

I am sure many of my readers are well aware of these options to download fixes, however I still find customers that were not aware of the change from FTP to FTPS. It is not a huge modification, but it is just enough of a change that we need to remember to do things just a little bit differently when we plan to obtain our next set of fixes for our next maintenance window.

Tools for Documentation

Edit: Some links no longer work.

Originally posted August 2015 by IBM Systems Magazine

Back in 2012, I wrote a blog post titled “The Case for Documentation.” Then just recently, a reader made a comment:

“I see in 3 years there has not been a single comment on this article. I’ve been so deep into Power and AIX, like you mentioned, walking around with knowing all there is to know in and around the environment I look after. Finally there has been an official request submitted to document PowerHA clustered environments and other smaller ones. I am so much in the thick of it that I start off and end up with a too technical visio drawing or veer off track in explaining an area. Have you got a guideline or template to give me an idea how I can start and get to finish a fairly sort and sweet “walk through” document that is just informative enough to satisfy those at management level or even my specialized level?”

Let me address this with some of my favorite tools to document Power Systems. Some of these tools have “prettier” output than others, but I think they’re all valuable when it comes to documenting your running systems.

PowerHA Tools

The original question was PowerHA specific, so let’s start with the PowerHA snapshot tool. The cluster snapshot tool lets you save and restore cluster configurations by saving a file a record of all the data that defines a particular cluster configuration. Then, you can recreate a particular cluster configuration, provided the cluster is configured with the requisite hardware and software to support the configuration. This snapshot tool can also make remote problem determination easier because the snapshots are simple ASCII files that can be sent via e-mail.

You can also use the PowerHA-specific qha and qcaa scripts. These are real-time tools that you can use with your running systems more than as a deliverable, but they’re still valuable. Alex Abderrazag has provided a nice script to help you understand cluster manager internal states.

HMC Scanner

When it comes to documenting the way my servers have been configured, I like to use HMC Scanner. HMC Scanner gives you a nice summary spreadsheet with almost anything you want to know about your environment, including serial numbers, how much memory and CPU are free on your frame, how each LPAR is configured, information on VLANS and WWNs, and much more. I did a video on running HMC Scanner and IBM’s Nigel Griffiths has also posted a video on HMC Scanner for Power Systems. HMC Scanner works for AIX, IBM i, Power Linux and VIOS LPAR/VM.

System Planning Tool

I also like to use the IBM System Planning Tool (SPT), which I blogged about in “Configuring Your Machine Before it Arrives” and which you can find on the IBM support tools website. The SPT provides nice pictures of the machines showing which slots are populated and assigned to which LPARs.

If you’re comfortable with the command line, you can manipulate sysplans with the following commands, which may be easier than going into the GUI to do the same functions:

lssysplan
rmsysplan
mksysplan
cpsysplan

viosbr

For VIO server-specific documentation, I like to use viosbr. After you’ve taken a backup, run:

viosbr –view –file

This provides a lot of information to document the setup of your VIO server. It will show your controllers, physical volumes, optical devices, tape devices, Ethernet interfaces, IP addresses, hostnames, storage pools, optical repository information, ether channel adapters, shared Ethernet adapters, and more.

snap –e

AIX-specific commands would include snap –e, which lets you gather a great deal of system information and run custom scripts to include other information with your snap. This tool is often run in conjunction with support to collect the information they need to help resolve issues with your machine.

prtconf

Another worthwhile command is prtconf. This command gives you information like model number, serial number, processor mode, firmware levels, clock speed, network information, volume group information, installed hardware, and more.

IBM i Options

For IBM i, the midrange wiki has good information about different methods you can use to gather data, including how to print a rack config from a non-LPAR system:

  1. Sign on to IBM i with an appropriate userid
  2. On a command line, perform command STRSST
  3. Select option 1, Start a service tool
  4. Select option 7, Hardware service manager
  5. F6 to Print Configuration
  6. Take the defaults on Print Format Options (use 132 columns)

HMC

In the new HMC GUI, you can select your managed server then Manage PowerVM and you have options to see your virtual networks, virtual storage, virtualized I/O, and more. This information can also be helpful in documenting your environment.

Self-Documenting Tools

I find there’s value in having systems that can “self-document” via scripts and tools compared to administrators creating spreadsheets that might or might not get regular updates as soon as changes occur. Somhotlinke might find self-documenting tools don’t provide the correct information, which leaves us with the question of whether it’s better to have no documentation or wrong documentation when you’re working on a system.

Self-documenting tools are a starting point. Whatever documentation you have on hand, take the time to double-check what the actual running system looks like compared to what you think it looks like. By not assuming anything about your running systems, you can avoid creating additional problems and outages because reality didn’t match what the documentation said.

Many Different Documentation Tools

From the frame, to the OS, to the VIOS, to the HMC, there are many different pieces of your infrastructure to keep an eye on and many different tools you can use to document your environment. I’m sure readers use many other tools and I’d be interested in hearing about those. Please weigh in with a comment.

To VIOS or Not to VIOS Revisited

Edit: Still worth considering.

Originally posted September 2014 by IBM Systems Magazine

In 2010, I wrote an article that covered the pros and cons of the virtual I/O server (VIOS). It’s still a topic that I run into today, especially as more IBM i customers consider attaching to SANs. In the article, I mentioned some of the concerns customers have, including their VIO server being a single point of failure, and the new skills that are required to administer the VIO server.

I want to reinforce the idea that you can build in redundancy when you design your VIO servers to reduce single points of failure. Some customers like to have dual VIO servers on each physical frame, but you can take it further than that. You can have one set of VIO servers to handle your storage I/O, and another pair to handle your network I/O. Some customers go one step further and segregate their production LPARs onto production VIO servers, and put their test/dev LPARs onto another set of VIO servers.

More Flexibility

You have a great deal of flexibility in how you configure and set up your Power Systems servers depending on the needs of your business.

IBM has made great strides in the usability of VIOS, especially for those uncomfortable with the command line. If you truly don’t want to log in as padmin and do your work from the shell, the Hardware Management Console (HMC) GUI gets better with each new release.

When you click on the Virtual Resources section of the HMC, you have access to Virtual Storage Management, Virtual Network Management and Reserved Storage Device Pool Management. Although these options have been around for a while, some don’t realize they exist or that ongoing improvements are been made to the interface and the choices that are available.

These options continue to become more powerful. For example, when I go into Virtual Network management, I can create a VSwitch, Modify a VSwitch, Sync a VSwitch and Set a VSwitch mode. I can view my existing VLANs and my shared Ethernet adapters.

Similarly, I can manage my storage through the Virtual Storage Management GUI. Modifying which hdisks are assigned to which LPAR and modifying virtual optical disk assignments to partitions can all be handled via the GUI.

I still prefer to use the VIO command line, and I still encourage you to learn how to do it as I think you have more power and control over the system using that method, but it’s becoming less mandatory to work as padmin than it used to be.

Easier Installation

Another powerful new tool is the capability to use the HMC GUI to actually install VIOS. Instead of fooling around with physical media or setting up your NIM server to allow you to load your VIOS, you can now manage a VIOS Image Repository on your HMC, where you store the VIO optical images on the hard drive of your HMC. I was pleasantly surprised when I was shipped a 7042-CR8 HMC with the HMC V8.8.1 code on it, the VIO install media was preinstalled on the HMC hard disk.

Loading that first VIO partition onto a new system was a snap. Once I got everything properly configured on the network and defined my VIO partition via the HMC, I was able to easily load multiple VIOS LPARs by clicking on the Install VIOS radio button and filling in a few network parameters in the GUI.

This is quite a change for people who are new to Power Systems servers, or those who don’t have NIM servers or don’t know how to use NIM servers. IBM i shops may never have a NIM server in their environments so that option isn’t even available for them.

When customers purchase some of the smaller Power Systems servers and opt to get a split backplane, it can be a challenge to get their second VIO server loaded as they can’t connect their DVD to their second disk controller. Allowing for installation from the HMC greatly simplifies the deployment of VIOS, especially in new environments. Preloading the necessary code only makes it that much easier.

More Alternatives

Another development that has arisen since I first wrote that article is the widespread adoption of NPIV, which gives admins an alternative to vSCSI. The advantage is that instead of being concerned with mapping LUNs from VIOS to client partitions, you can offload some of that heavy lifting to your SAN team. Now the SAN team is able to map LUNs directly to the client LPARs that will be using them. Some SAN teams don’t care for the extra burden. In one scenario that made the change, they had nearly a hundred LPARs on a frame, and they had been handling the vSCSI mappings at the VIOS level. This allowed the SAN team to map a great many LUNs to a relatively few WWNs. Once they migrated to NPIV, this burden shifted, and the SAN team was less than thrilled about it.

Comfort and Choice

The debate will continue, but the resistance seems to have lessened somewhat around the deployment of VIOS. As more shops get comfortable with the technology and more people spread the word, there is less fear around using this method to share adapters across many LPARs.

IBM continues to allow for choice in how you build your machines. I still know of customers that don’t virtualize anything and instead have dedicated CPUs and adapters for each LPAR. This type of a setup is becoming more rare as companies realize all of the benefits of virtualizing their environments using VIOS.

IBM Delivers With POWER8

Edit: And now I wait for POWER10.

Originally posted April 2014 by IBM Systems Magazine

POWER8 technology created some buzz when it was first discussed at the Hot Chips conference and slides that describe the chips could be found online before today. But now we have more information about the actual systems that will be shipping when they become generally available in June.

When you look at the Power Processor Technology Roadmap since 2004, you can see that we regularly get new, more powerful chips. We are almost spoiled. When IBM says it is going to deliver, it does just that, with both new hardware and new OS releases.

In 2004 we had POWER5 followed by POWER5+. In 2007 we had POWER6, which led to POWER6+. In 2010 we had POWER7 and the most current, POWER7+. In 2014 we have POWER8, and there are already charts that show POWER9 is being planned for the future. IBM has consistently delivered on its roadmaps.

I recently attended an education session for IBMers and business partners that covered information around POWER8 and the new IBM hardware announcements that are being made today. I am going to hit some of the highlights, but additional information will be included in future posts.

The POWER8 Chip

The POWER8 chip is another step up from what has come before. We have gone from four threads to eight threads per core. With simultaneous multithreading (SMT) enabled you can have up to eight threads running on a core, which means you can get more work done per CPU cycle.

The charts that I saw showed a linear increase in the number of transactions that could be completed when you compared SMT1 to SMT2 to SMT4 to SMT8. As you made each transition you could see the number of transactions increase. Obviously some workloads won’t benefit from SMT, but those will be the exception rather than the rule.

I also saw charts that compared I/O bandwidth and memory bandwidth on the new systems compared to older models, and the numbers were impressive. It was a significant increase that I will be discussing further in future articles.

While POWER7 technology had up to eight cores per socket, the POWER8 chip has up to 12 cores per socket. New memory controllers and memory cache on the system improve memory latency and performance.

The way the cores communicate with one another across the SMP interconnect has also improved so it takes less “hops” to go from one core to another in the system. The chip also boasts a direct PCIe Gen3 I/O interface for incredible bandwidth.

There is 512 K L2 cache per core, 96 MB shared L3 cache and up to 128 MB L4 off-chip cache.

Understanding the Models

How comfortable are you with the model numbers of the Power servers. If someone says 720, 770 or 795, do you have a pretty good idea what server they are talking about? With today’s announcement, how many of you were expecting to see 820 and 870 server models? This is not going to be the case. The servers are now named with four- or five-digit combinations of letters and numbers. For this first announcement, the servers all start with the letter “S,” which signifies that they are scale-out servers. As time goes on I would expect to see models that start with the letter “E” for enterprise systems. The second digit indicates that it is running POWER8. The third digit indicates the number of sockets in the server, and the last digit indicates how much rack space it takes up, for now either 4U or 2U.

For example, the S822 is a scale-out server, running POWER8 technology, with two sockets, fitting in 2U of rack space. The S824 is a scale-out POWER8 two socket 4U server. If you see an L in the fifth digit, like the S822L, then that is a Linux-only system, much like today’s 7R1, 7R2 or 7R4 servers.

We need to pay attention to the lettering. The L designates it will only run on Linux. The 2U non-L models can run AIX and Linux. The 4U non-L models can run AIX, IBM i and Linux. At the time of this announcement, you cannot have an I/O drawer with PCIe slots on any of these machines, although a statement of direction indicates that this capability will be available in the future.

Here are the specs for the new servers:

  • The two socket 2U servers (S822) can have different configurations depending on whether you populate both sockets. If you have one socket populated, you can have six or 10 cores, with up to 512 GB of memory. There are six PCIe Gen3 low-profile hotplug adapters in this configuration. If you have both sockets populated, you can have 12 or 20 cores, with up to 1 TB of memory. Nine PCIe Gen3 low-profile hotplug adapters are included in this configuration. You can run PowerVM with AIX or Linux, but not IBM i on this server.
  • The S822L can have 20 or 24 cores, with up to 1 TB of memory and nine PCIe Gen3 low-profile hotplug adapters. You can run PowerVM or PowerKVM and you can only run Linux on this machine.
  • The S814 is a one-socket 4U system that can come in a 4U or tower form factor. It has six or eight cores and 512 GB of memory. You can have seven PCIe Gen3 full-high hotplug adapters, and you can run PowerVM with AIX, IBM i or Linux.
  • The S824 is a two-socket 4U server. If you populate one socket you can have the same specs as the S814, but if you populate both sockets you can get 12, 16 or 24 cores with up to 1 TB of memory. You will have 11 PCIe Gen3 full-high hotplug adapters and can run PowerVM with AIX, IBM i or Linux.

Performance

The rPerf and CPW numbers that I saw showed improvements, and I will write more about this in the future as well. IBM asked us not to share the numbers until they are audited and vetted, but I will be surprised if the improvements, especially when compared with competitor’s machines, are not as dramatic as we saw during the training sessions. It was also amazing how these new systems perform when comparing an S824 vs a POWER5+ 595 or a POWER4 690.

Another part of the story is how this improved performance translates into needing fewer cores to do the work that you need your server to do. That means you will need to spend less to buy hardware, and you will receive better performance per dollar spent.

We will be able to perform Live Partition Mobility operations between POWER6, POWER7 and POWER8 machines, assuming we’re using the correct processor mode. We can run the LPARs in POWER6 mode, POWER7 mode or POWER8 mode. This will also make it possible to run OS versions that are not POWER8 aware assuming you are using VIOS for your I/O.

Miscellaneous Information

You can run AIX in POWER8 mode with full I/O support once you get to:

  • AIX 6 TL7 SP10
  • AIX6 TL8 SP5
  • AIX 6 TL9 SP3
  • AIX 7 TL1 SP10
  • AIX 7 TL2 SP5
  • AIX 7 TL3 SP3

POWER8 support for IBM i will be available in IBM i 7.1 TR8 and IBM i 7.2, as well. We will need to be running VIOS 2.2.3.3 for POWER8 support.

There is also a new HMC model 7042-CR8 that will be available later in the year.

I should be getting my hands on some of these models shortly and will be able to share more information once I do.

These are some of the highlights that I found interesting. What are you looking forward to the most with these new systems?

Top 10 Reasons AIX Will Endure

Edit: Still good stuff.

Illustration by Paul Price

Originally posted June 2013 by IBM Systems Magazine

The AIX* operating system continues to be a leader in the UNIX* marketplace. AIX celebrated 25 years in 2011, and users have every reason to expect that the operating system will continue to evolve and move forward for the next 25.

Businesses of all flavors in all industries have varied experiences with the operating system. Some have been running it for many years—or even from its inception. Others are new to the environment as IBM continues to migrate clients from other UNIX or Windows* platforms.

In most cases, people making the switch want an enterprise-class operating system running on enterprise-class hardware. They don’t want to answer their problems by rebooting the system. Businesses in all industries have critical workloads, and unexpected downtime is not an option—they need robust hardware that can let them know if problems are on the horizon.

Clients should think about their end game and what they’re trying to accomplish. You want a high-performing processor at the heart of your hardware platform.

They also want their hardware to call home to IBM if it has an issue. They want to call IBM support and get answers to all of their hardware and operating system questions. It’s not infrequent to hear stories of clients that didn’t even know they had a problem, but IBM support called to let them know they’d be stopping by to replace a failing power supply and no downtime would be required.

Clients deciding what criteria they’ll use when selecting servers and operating systems shouldn’t base their decision strictly on price, where the acquisition price point wins no matter what the total cost of ownership might be. They should also think about their end game and what they’re trying to accomplish. You want a high-performing processor at the heart of your hardware platform. You want what IBM calls RAS—reliability, availability and serviceability. You should also look for the satisfaction of the platform’s end users along with those who maintain the servers.

Top 10

For these 10 reasons, AIX should still be going strong for many years to come:

1 It’s easy to use. AIX clients can use command-line tools such as smitty, which is menu-driven and can help find the tasks you’re seeking without memorizing the commands and flags on the command line. The tool keeps a history of the commands run and sends the output from those commands to a log file. It can also display the actual command it ran “under the covers.” You can go into smitty, select your options, hit F6 and it will display which command will run. This also lets you automate tasks with a script. If you prefer a GUI, you can run tools such as IBM Systems Director, which can help manage an entire fleet of servers and the virtual machines running on them.

2 It’s easy to learn more about the operating system and the hardware. A great deal of information is available in the IBM Redbooks* publications, freely available documents that cover hardware and software products in great detail. Additionally, many people are writing blogs, publishing articles, recording videos and sharing knowledge with one another. In a short amount of time, you can get up to speed with the various ways to use the system. Even if you’re a longtime user, you can learn more by reading the ample and ever-increasing documentation.

3 It’s easy to get support when you need it. You can call IBM and ask how-to questions, or if you run into issues, you can easily speak to experts that can help. They take snapshots, or “snaps,” of your system to help analyze it, and they have secure shared-screen sessions available to help with troubleshooting, if necessary. Since IBM develops the processor, assembles the machine and creates the operating system, it owns the stack; therefore, the company deeply understands the system your business is running.

4 Because IBM owns the entire stack, it creates the hardware and the firmware. And since it employs the developers, it can get field questions answered by the people who actually wrote the code. You can feel confident knowing that the experts who built the hardware also built the virtualization hypervisor that runs on top of it, enabling virtualization with little overhead.

5 The ecosystem is full of friendly people willing to help you learn. Many users are willing to share their expertise, and if you want to learn more, the sources are available. Training classes and conferences offer opportunities to learn directly from experts. User groups and virtual user groups let users network and learn from one another.

6 AIX just runs. Although it’s obviously recommended that you continue to update the firmware on your server and install fixes and patches to your operating system, if you were to neglect it and let it sit in a corner of your machine room, it would happily hum along with little intervention. Over the years, you’ll find many examples of clients who followed the adage “if it isn’t broken, don’t fix it,” and they just let their systems run. Ask other AIX shops how often their production LPARs go down due to the operating system or the Power Systems* hardware. The answer is likely close to never.

7 As good as the platform is, it keeps getting better. IBM consistently delivers more functionality via faster hardware and more functionality from the operating system.

8 IBM provides innovations not found elsewhere in the enterprise UNIX space. Live partition mobility, or moving a running workload with no outages from one frame to another, doesn’t happen on other platforms. Active Memory Expansion allows for compressed memory, which drives higher utilization of the memory you purchased. Active Memory Sharing allows workloads to shift memory consumption between LPARs as demands for that memory shift over time. Workload partitions (WPARs) let you run multiple workloads on a single LPAR. Simultaneous multithreading allows for more work to be performed per processor core. All of these innovations keep IBM leading other vendors by a wide margin.

9 IBM has a clear roadmap. The company has predictable cycles in releasing new processors, hardware and versions. It has consistently delivered on its technology, where others have stumbled along the way.

10 IBM makes a huge investment in R&D and chip technology. This investment shows in the products that IBM sells. The company also trickles down innovations from other product lines, for example, using mainframe technologies in its midrange servers. IBM has been learning lessons in the computing field for more than 100 years, and that knowledge gets implemented in the hardware it sells. As IBM continues to innovate and invest in the product line, clients will continue to benefit by running an enterprise-class operating system for many years to come.

Getting a Handle on Entitled Capacity and Virtual Processors

Edit: Some links no longer work.

Entitled capacity and virtual processors frequently come into play when you’re working with shared processor pools, and multiple virtual machines are using that shared processor pool.

Originally posted July 12, 2012 by IBM Systems Magazine

Entitled capacity and virtual processors aren’t new to Power Systems. They frequently come into play when you’re working with shared processor pools, and multiple virtual machines (VMs) are using that shared processor pool.

However, many people struggle with these concepts, particularly individuals who are new to the Power platform due to a migration from some other flavor of UNIX.

Physical and Virtual CPUs

First, keep in mind you can never use more physical CPUs than virtual CPUs as defined in your LPAR. Even if you allocate one virtual processor to an LPAR and set it to be uncapped, you can’t run more than one physical processor because there would be no other virtual processors available.

This way, you can limit the LPARs in your shared processor pools even if your LPAR is uncapped and there are 16 processors available in a shared processor pool. You still won’t be able to use more than one physical CPU because you only allocated one virtual CPU.

A virtual processor can represent from 0.1 to 1 of a physical processor. If you have one virtual processor, the range it can physically consume will never be more than one. If you have three virtual processors, you can use from 0.3 to 3, but never more than three.

It makes sense, as you’re basically giving your VM the illusion that it’s dealing with a physical processor. If it boots up, and sees three virtual processors, even if it’s running on 0.3 physical processors, it won’t see more than three processors. If it’s running uncapped and wanted to use four physical processors, where would they run if there are only three virtual processors?

Complicating the Issue

Simultaneous multithreading (SMT) can confuse the issue more. With POWER7 you can have four SMT threads, so the one virtual processor you set up will appear as four logical processors in your VM. If you were to turn off SMT, you would only see one logical processor.

When you assign physical processor resources to your VM, you’re setting up your entitled capacity. No matter what the other VMs on your frame are doing, your VM is entitled to use that much physical processor. It might donate spare cycles it’s not using, but if the VM needs those cycles, it’s guaranteed to get them.

If your VM is uncapped, it can utilize excess cycles in the shared processor pool. By doing this, you might find your entitled consumption can exceed 100 percent. You might find your VM consistently runs at 300 percent of entitled capacity. Capping the VM will limit how much physical processor it can use, and you’ll never run at more than 100 percent of the entitled capacity. This is another way to limit a VM’s processor utilization—you can cap it as well as limit how many virtual processors it has.

No Easy Way Out

It would be easy to say, “Great, I’ll just leave all of my VMs uncapped and configure them to use all of the physical CPUs available by setting all virtual processors on all VMs to the same number of physical processors that I have in my shared processor pool, and let them fight it out.” Although this might be a way to get started, this method can have drawbacks.

Remember to pay attention to your workload. Consider the additional context switching you’ll see if your VM is uncapped, but its entitlement is too low. Keep in mind the number of virtual processors you’ve defined. If you’ve defined eight but only use two, you might end up with additional overhead on your system. Although the system does have processor folding—in which it won’t schedule work onto the unused virtual processors in order to have better memory cache hits—it’s best to avoid defining excess virtual processors in the first place.

Also remember that your virtual processors can impact your job stream. It might make sense to have four virtual CPUs and 1.6 processing units, this would give you 0.4 processing units on each virtual CPU. With a highly threaded workload, this might be optimal. However, if you have two virtual CPUs and the same 1.6 processing units, each virtual CPU gets 0.8 physical CPUs to utilize. Depending on the workload, this scenario might make more sense.

Monitor your workload and try to match up your real-life workload with the settings on the VM. If you entitled the VM to 0.5 physical CPU, but it’s consistently using two physical CPUs, try bumping up the entitlement so you know it won’t be starved for resources later. Right now, your pool might have enough capacity to allow the VM to use those two CPUs without issues, but if workload characteristics change, you could discover that jobs that used to run fine having issues because they’re starved for resources.

Another way to manage resources in an uncapped pool is by using the weights you assign to VMs through the hardware management console (HMC). This might not be as granular as you think. There won’t be much of a difference between one VM getting a 200 share, and another getting a 180 share. So use some meaningful numbers, make the higher-priority VMs 250 and the lower-priority machines 50, for example. You want the numbers to actually mean something when two VMs are competing for resources, and one is far more important.

Educate Yourself

For more on entitled capacity and virtual processors, I recommend the additional articles and IBM Redbooks papers listed in the Resources section of this article. Another way to learn more is by testing LPARs on a sandbox server and seeing what happens to your systems when you make dynamic changes to your physical and virtual processors on a running system. Reading about the topic is one thing, but to make sure you understand it, make changes and see if you get the results you expected.

An LPAR Review

https://robmcnelly.com/an-lpar-review/


Configuring Processor Resources for System p5 Shared-Processor Pool Micro-Partitions
http://www.ibmsystemsmag.com/aix/administrator/systemsmanagement/Configuring- Processor-Resources-for-System-p5-Shar/

IBM Redbooks
http://www.redbooks.ibm.com/redbooks/pdfs/sg247590.pdf

AIXpert Blog: AIX Virtual Processor Folding is Misunderstood
https://www.ibm.com/developerworks/mydeveloperworks/blogs/aixpert/entry/aix_virtual _processor_folding_in_misunderstood110?lang=en

POWER7 Virtualization Best Practice Guide
http://www.ibm.com/developerworks/wikis/download/attachments/53871915/P7_virtualization_bestpractice.doc?version=1

Reliable Restores

Edit: Some links no longer work. Originally published on IBM Systems Magazine

mksysb backups make AIX recovery easy

February 2011 | by Anthony English

If you ever need to restore your AIX system, you’ll need a reliable OS backup. You can create this via the mksysb command, which, as the name implies, makes a system backup. That’s not to say it gets your entire AIX system, but it does create a backup of the OS itself, the root volume group. For any other volume groups you’ll need to rely on other backup utilities.

You might need to build a system from a mksysb for disaster recovery or if your OS has become corrupted, but those aren’t the only times a backup comes in handy. A mksysb is a simple and effective way of migrating to new hardware. It can also be used to clone an existing AIX system. For example, you could create a Standard Operating Environment (SOE) LPAR, take a mksysb backup of it and use that to build new LPARs.

Tracing Your Roots

A mksysb is much more than a backup of the files in the rootvg file systems. It includes a boot image, optional software that has been installed into rootvg and system informational files. The mksysb contains the layout of the rootvg logical volumes and the file systems. This is important, as those file systems get created as part of the mksysb restoration process. That saves a lot of work and time. Restoring a mksysb even gives you the option of recovering your devices, so you don’t have to reconfigure network settings, disk attributes and so on. You’d normally use this only when you’re restoring onto the same system you backed up.

In the days of stand-alone systems, the mksysb command would write to a dedicated device such as a tape drive. Today it’s more common for the mksysb to be written to a file on disk and stored on a different LPAR. That way it can be made ready for use without needing to load physical media such as tapes or DVDs.

At Your Command

The mksysb command can be run from the command line or using the SMIT fastpath smitty backsys. You have to specify the output device or file. The mksysb file that’s created is typically between 2 and 4 GB, but it could be much larger, depending on the size of your rootvg. The target file system needs to have enough space for this file, and be large-file-enabled. The ulimit should be set to unlimited for the user who runs the backup.

Updating Your Image

When the mksysb is run, it includes details about volume groups, logical volumes, file systems, paging space and physical volumes. These details are stored in a file called /image.data, which can be created at any time with the mkszfile command, or at the time of the mksysb using the -i flag. This flag provides get an up-to-date snapshot of the file system sizes and mount points. Figure 1 shows the output of a mksysb command that has been written to a file on the /backup file system. As the -i option was used, you can see that a new /image.data file was created.

Be Exclusive

There may be files or directories in your rootvg that you don’t want to include in the mksysb. You can use another utility to back them up, or you may not want to back them up at all. To exclude certain files from your backup, create a file called /etc/exclude.rootvg and enter the patterns of file names or directories that you don’t want to include. Figure 2 shows an example of /etc/exclude.rootvg. When using this exclude file, you need to call the mksysb command with the -e flag. The mksysb command documentation provides more details on the file format.

mksysb to DVD

It’s common to create the mksysb backup from a NIM server (see links), or with the mkdvd command which will create the backup in a format suitable for DVDs. Many administrators are familiar with using NIM to create the mksysb, so I’ll focus on creating a mksysb backup in DVD-compatible format. This doesn’t require burning the backup onto a physical DVD. If you’re using the virtual I/O server (VIOS) Virtual Media Library, you can use mkdvd to create the mksysb file in ISO format, copy it to the VIOS and then load onto a virtual optical device when you want to use it. This is a simple and quick way of cloning or recovering your AIX OS, and it lets you keep bootable OS backups handy without the need for physical media. For more information about the virtual media library, read “Media Release.”

When creating the mksysb in DVD format, the mkdvd command can use an existing mksysb file or create a new mksysb. If the mksysb has been created beforehand, mkdvd can point to it using the -m flag. If you create a new mksysb backup first, the mkdvd creates a new /image.data file. It also accepts the -e flag to exclude unwanted files or directories.

New File Systems

To create a new mksysb and save it in ISO file format, use mkdvd -eS. This creates some temporary file systems in rootvg, so make sure you have spare disk space. You can specify an alternate volume group for the file systems with the -V flag. The -S flag will ensure the final ISO files don’t get deleted, so you can copy them to a remote host such as the VIOS. You’ll need to clean them up on the source host after you’ve done the copying.

The final mksysb file in ISO format is put into /mkcd/cd_images and is called “cd_image_” followed by the process ID as its suffix. If multiple volumes are required, the final images have suffixes to indicate their volume number. You can see a sample output of the mkdvd command with its default file systems in Figure 3.

You can leave the mkdvd to create the file systems it needs in the default locations, or point to other directories using the flags outlined in Figure 4.

A backup is only useful if it can be relied on for successful restores. When it comes to restoring the mksysb, you may need more than the original backup to proceed. The restore process requires 1) a device to boot from, 2) the mksysb backup itself and possibly 3) the AIX product media for installing devices. Ordinarily, if you’re restoring to the same system you backed up from, the mksysb will serve as the boot device, provided the backup itself is bootable. If you’re restoring to a different system, for example for a disaster recovery test, or if the backup was done via mkdvd with the -B flag (non-bootable), you’ll need to boot off the AIX installation media. The media can be a file-backed device presented via a virtual optical device on the VIOS.

You can customise your mksysb backup to specify options such as the disks you want to restore to, and whether to recover device information such as network settings. You can make these selections via the System Managed Storage (SMS) menus when you boot the target LPAR in maintenance mode. You can also do the restoration in unattended mode by editing the /bosinst.data file before you run the backup. If the system you’re restoring to has access to a diskette drive, you can create your own /bosinst.data and point to that at the time of the restore.

How do you know if your mksysb contains all of the device drivers you need for restoration? If you’re restoring to a different server, you’ll generally want to boot from the AIX product media to get any missing device drivers installed. If you are restoring to the same hardware configuration you backed up from, but have booted from product media that is a later version than your mksysb, you’ll be asked to load the product media so that the system you restore to will have its AIX software updated. You can overwrite these additional installations by editing the bosinst.data file and setting INSTALL_DEVICES_AND_UPDATES to no. The default is yes.

Ready-Made Recovery

It’s important to do regular tests of the mksysb restoration, not only to ensure the backup was successful, but also as a means of having documented procedures for rebuilding your system in the event of a disaster. It pays to be confident with your mksysb restoration procedures. Hardware failures aren’t the only reason you may need to use them. Simple mistakes can damage or render an OS as unusable, and the ability to restore AIX quickly and reliably is a key element in your system-recovery strategy.

Tape Storage: An Oldie but a Goodie

Edit: Good information.

Originally posted August 2011 by IBM Systems Magazine

As much as we like to tell ourselves that older technology is becoming obsolete, I still see fax machines, dot-matrix printers, dumb terminals, and tape drives out in the wild. Some will argue that we should be doing disk-to-disk backups and eliminate tape entirely, but when it comes to cost per GB and the ease of storage and transport, tape isn’t going away any time soon.

We recently had a customer that is new to AIX ask how we could back up all of its volume groups onto a single tape so it wouldn’t need an automatic tape changer, thus eliminating the need to handle more tapes than necessary.

The organization had multiple volume groups, including its rootvg, and it wasn’t using more space on disk than would fit on one tape. So the question was, “How could the customer get all of those volume groups and all of the data onto a single tape?” Basically, it needed to append all of the other volume groups onto the tape after the mksysb was done.

Search for a Solution

Looking online, I found someone else had the same question in a forum. “Is it possible to use AIX’s mksysb and savevg to create a bootable tape with the rootvg and then append all the other VGs?”

To create the backup, one astute responder suggested a script similar to this one:

tctl -f /dev/rmt0 rewind
/usr/bin/mksysb -p -v /dev/rmt0.1
/usr/bin/savevg -p -v -f /dev/rmt0.1 vg01
/usr/bin/savevg -p -v -f /dev/rmt0.1 vg02
/usr/bin/savevg -p -v -f /dev/rmt0.1 vg03
tctl -f /dev/rmt0 rewind

The script’s author stated:

  • mksysb backs up rootvg and creates a bootable tape.
  • Using “rmt0.1” prevents auto-rewind after operations.

He went on to explain restore procedures, for rootvg, boot from tape and follow the on-screen prompts (a normal mksysb restore). For the other volume groups:

tctl -f /dev/rmt0.1 rewind
tctl -f /dev/rmt0.1 fsf 4
restvg -f /dev/rmt0.1 hdisk[n]

“fsf 4” will place the tape at the first saved VG following the mksysb backup. Use “fsf 5” for the 2nd, “fsf 6” for the 3rd, and so on.

If restvg complains about missing disks, you can add the “-n” flag to forego the “exact map” default parameter. If you need to recover single files, the writer suggested:

tctl -f /dev/rmt0 rewind
restore -x -d -v -sf -f /dev/rmt0.1 ./path/file

In addition, I recommend adding a tctl offline or rewoffl at the end of the script to eject the tape. Otherwise, you will have a bootable tape sitting in your system. And, depending on your bootlist, if the machine restarts you could boot off of the tape, or if someone forgets to swap the tapes, you will overwrite it.

Exclusivity

If your data set nearly fits on a single tape, you can use /etc/exclude.rootvg and /etc/exclude.vg01 to exclude files and directories that we don’t need to back up. If there’s some scratch data on the system that doesn’t need to be backed up and restored, just exclude it.

This mksysb documentation tells us that the tape format includes a boot image, a bosinstall image and an empty table of contents, followed by the system backup (root volume group) image. The root volume group image is in backup-file format, starting with the data files and then any optional map files. In order to exclude files, it explains:

Use -e to exclude files listed in the /etc/exclude.rootvg file from being backed up. The rules for exclusion follow the pattern-matching rules of the grep command.

To exclude certain files from the backup, create the /etc/exclude.rootvg file, with an ASCII editor, and enter the patterns of file names to exclude in your system backup image. The patterns in this file are input to the pattern matching conventions of the grep command to determine which ones will be excluded from the backup. If you want to exclude files listed in the /etc/exclude.rootvg file, select the Exclude Files field and press the Tab key once to change the default value to yes.

  • For example, to exclude all of the contents of the directory called scratch, edit the exclude file to read:
/scratch/
  • To exclude the contents of the directory called /tmp and avoid excluding any other directories that have /tmp in the path name, edit the exclude file to read:
^./tmp/

All files are backed up relative to . (current working directory). To exclude any file or directory for which it is important to have the search match the string at the beginning of the line, use the ^ (caret character) as the first character in the search string, followed by . (dot character), followed by the filename or directory to be excluded. If the filename or directory being excluded is a substring of another filename or directory, use the ^. (caret character followed by dot character) to indicate that the search should begin at the beginning of the line and use the $ (dollar sign character) to indicate that the search should end at the end of the line.

A Viable Option

Obviously, the ability to backup an entire system onto a single tape only works in smaller shops with smaller amounts of data to back up, but there are still quite a few around today. While we like to think everyone is working in large enterprise data centers with around-the-clock operations staff, and dedicated storage, network and server teams, plenty of smaller customers have small staffs and small data sets where this idea might come in handy.

Storage Migration Tips

Edit: Still good stuff. Some links no longer work.

Move data without downtime using AIX

Originally posted April 2011 by IBM Systems Magazine

Organizations change storage vendors all the time, for many different reasons. Maybe a new storage product has come out with new features and functionality that will benefit the organization. Maybe the functionality isn’t new, but is unknown to your organization and someone decided it’s needed. Maybe a new storage vendor will include desired functionality in the base price. Maybe it’s a “political” decision. Maybe the equipment is just at the end of its life.

Whatever the reason, when it’s time to move from one storage subsystem to another, what are some options that you have to migrate your data using AIX? With ever-growing amounts of storage presented to our servers, and databases with sizes from several hundred GB to a few TB becoming more common, hopefully you’re not even considering something like a backup and restore from tape–along with all of the downtime that goes with it. Instead, you should focus on how to migrate data without downtime.

Evaluate the Environment

The first question I would ask how is your environment currently set up? Are you currently using virtual I/O (VIO) servers to present your logical unit numbers (LUNs) to the client LPARs in your environment using virtual SCSI or N_Port ID Virtualization (NPIV)? Are you presenting your LUNs to your LPARs using dedicated storage adapters? Take the time to go through different scenarios and look at the pros and cons of each. Call IBM support and get their opinion. Talk to your storage vendor. The more information you have, the better your decision will be. If possible, do test runs with test machines to ensure your procedures and planning will work as expected.

Possible Migration Solutions

If you’re using dedicated adapters in your LPARs to access your storage area network (SAN), it could be as simple as:

  • Loading the necessary storage drivers
  • Zoning the new LUNs from the new storage vendor to the existing host bus adapters (HBAs)
  • Running cfgmgr so that AIX sees the new disks
  • Adding your new disks to your existing volume groups with the extendvg command
  • Running the mirrorvg command for your rootvg disks, and the migratepv command to move the data in your other volume groups from the old LUNs to the new LUNs

The trick here is making sure that any necessary multipath drivers that are needed will coexist together on the same LPAR. In some cases, you may not be able to find out whether your desired combination is even supported. It may be possible that no one has tried to mix your particular storage vendors’ code before. This might be a nice time to test things in your test environment.

A cleaner solution may be to use a new VIO server for your new disks. If you have the available hardware on your machine–which would consist of enough memory, CPU and an extra HBA to bring up the new VIO server–then it could be the ideal scenario. A new VIO server, with the new storage drivers, and the new LUNs being presented to your existing client LPARs using vSCSI may be your best bet. The advantage of this method is the storage drivers are being handled at the VIO server level instead of the client level, like they would be with NPIV. The disadvantage would be handling all of the disk mappings in the VIO server. I prefer to run NPIV and map disks directly to the clients’ virtual Fibre adapters, but again you could have the issue of mixing storage drivers so you would really need to test things before trying it on production LPARs.

If a new VIO server isn’t feasible for whatever reason, and you’re currently running with dual VIO servers and vSCSI, you should be able to remove the paths on your client LPARs that are coming from your secondary VIO server, then unmap the disks that are coming from your second VIO server. You can then remove the existing disks from your second VIO server, remove any multipath code and then repurpose it to see the new disks with the new code.

Clean Up

After the data has been migrated, you can go back and clean up the old disks and then zone the new disks to the secondary VIO server as well. Remember to correctly set up your no_reserve locks in the VIO servers and your hcheck_interval attributes on your clients for your new disks.

Chris Gibson has a great article that covers migration scenarios in more detail, which you can read on the developerWorks website.

While your data is migrating, you might want to watch what is happening with your disks. In some cases, such as with the mirrorvg command, you might not be able to get disk information and run logical volume manager (LVM) commands as your volume group is locked. While you can still run topas to watch your disk activity and see that data is being read from your source disk and written to your target disk, you might want to get more detailed information. In this case, look at the –L flag in the AIX logical volume manager, which Anthony English covers, also on developerWorks.

“On LVM list commands, the -L flag lets you view the information without waiting to obtain a lock on the volume group. So, if you come across the message, which tells you the volume group is locked, and you really can’t wait, you could use:lsvg -L -l datavg

The first -L doesn’t wait for a lock on the volume group. The second one is to list logical volumes. To list a single logical volume, such as lv00, use:lslv -L lv00

And to list physical volumes (PVs), which are almost always virtual:lspv -L hdisk3

Backing Up Cloud

Edit: Other people’s computers.

Considerations on security, data ownership

Originally posted February 2011 by IBM Systems Magazine

I miss the good old days when I had maintenance windows that were long enough that I could bring my machine down to single user mode and back up the whole system. These backups contained all of the data that mattered to the company at the time. Twenty years ago, I could only back up my machine with reel-to-reel tape drives. I’d bring my machine down to single-user mode to perform the backup, and each tape backup would take 12 minutes. I remember this because we would set the time on a portable kitchen timer when we started each tape. When the timer went off, we’d head to the computer room to swap out the tape, and go to the console to “press G to continue the backup.” All of the important data lived on that one machine. We didn’t worry about distributed computing environments, as we weren’t running any at the time. Sure we had a few PCs scattered here and there, but they weren’t critical. The entire company and all of its data lived on that central machine, and users who sat in front of green-screen dumb terminals accessed it. There wasn’t any data that users stored locally; it was all stored on the machine in the computer room.

When I hear about cloud computing, this is still the kind of environment I picture: where people are logged into a central machine that exists in a computer room in the sky. I use several Web-based applications like salesforce.com, webex.com or Google Mail, where I know nothing about the servers nor where the applications run, and I don’t necessarily care about the hardware or operating systems the applications use. I log in, use the service and log out. I often find myself logging into the IBM virtual loaner program website, where I can utilize slices of IBM hardware for short periods of time for demonstrations or proof of concepts or education.

I’ve worked with companies that have cloud offerings, where I can very easily log in, spin up some resources on their servers and then spin them back down when I am finished with them. As long as my response time is acceptable, do I really care about the physical hardware these virtual instances run on?

I’ve also had customers who were unable to get resources to test hardware in their environment. Using the cloud, they were able to log on to a cloud provider, spin up some server resources, do the that they needed and spin the resources back down–all without waiting for their internal IT departments to acquire and configure hardware for them. This would also benefit users who have test hardware several generations behind what they’re using in production. Instead of using old hardware, they can use more modern machines in virtual environments as needed.

Consider This

There are benefits to cloud computing, but there may be a few things to contemplate when considering a leap from your own computing assets to those that you don’t control. I realize that these days we’re usually accessing cloud-based applications over the Internet instead of from a green screen directly attached to computers in the machine room, but concerns like privacy, security and availability need to be considered along with all the benefits that are touted with the cloud.

Backup and recovery is another consideration when deploying services to the cloud. How do we back up our data that lives in the cloud? Surely cloud providers offer snapshots and local backups, and maybe that’s good enough for what you’re doing. If you wanted to copy your data to machines that are under your control, would you use the network and some kind of continuous data protection in order to move data from the cloud to machines you own so that you have another copy of it? Or would that method of data protection defeat the purpose of having someone else handling infrastructure management?

What happens if somewhere down the road you decide you want to get out of the cloud? Are there going to be issues with getting your data or OS images back under your control? Can you easily clone the systems back onto your own hardware or will you be looking at server reloads?

I have watched customers struggle with liberating their data from outsourcing companies and contracts. The companies that manage the machines have custom tools and scripts that they don’t want to hand over. They may have information around how the machines were configured that they don’t want to share. What’s your plan to get out of the cloud or move to another cloud provider if you find the one you are using isn’t for you? What do you do if the service you’re using goes down, or the company goes out of business or they change the interface so much that you no longer like the way you use the tool? Will upgrades and outages happen on your timetable or on theirs? When you get used to accessing servers and applications from anywhere there’s a network connection and then you find the provider has an outage, you want to be sure providers offer information and status updates on when they expect to recover the systems.

I enjoyed reading a blog post from John Scalzi who was trying an experiment where he would exclusively use Google Docs and a Google laptop computer to write a novel. Technical glitches began causing delays, and he eventually retuned to working from his desktop, saying, “ Until ‘the cloud’—and the services that run on them—can get out of your way and just do things like resident programs and applications can, it and they are going to continue to be second-place solutions for seriously getting work done.”

Return to Centralization

While there are definite advantages to the cloud-computing approach in some situations, I can’t help but think that the whole idea has a “Back to The Future” feel to it, where we take distributed computing resources and try to centralize them again, or worse yet, rebrand existing offerings as cloud offerings so we can say we’re on the cloud bandwagon. Certainly there are going to be applications and situations that will benefit from moving applications out of data centers. We just need to be sure to do our homework and educate ourselves before making the leap.

HMC Users: Important Fix Available

Edit: Some links no longer work.

Originally posted August 30, 2011 by IBM Systems Magazine

This information has been circulating for awhile, and Anthony English covers the topic here and here. But I want to make sure HMC users are aware of this important update and the need to make sure you have the fix loaded if you’re at V7R7.3.0.

A problem is known to exist when using dual HMCs in one of two environments: either one HMC is at a different level than the other, or both HMCs are at the base HMC V7R7.3.0 level without fixes.

The problem is possible exposure to corruption that could cause you to lose partition profiles.

A fix is available and should be installed immediately on any HMC that might possibly be impacted by this problem.

If you’re using an HMC and an SDMC, be sure to get the fix for the SDMC as well.

From the IBM technical bulletin:

“This PTF was released July 18, 2011, to correct an issue that may result in partition configuration and partition activation profiles becoming unusable. This is more likely to occur on HMCs that are managing multiple systems. A symptom of this problem is the system may display Recovery and some or all profiles for partitions will disappear. If you are already running HMC V7R7.3.x, IBM strongly recommends installing PTF MH01263 to avoid this issue. If you are planning to upgrade your HMC to the V7R7.3.x code level, IBM strongly recommends that you install PTF MH01263 during the same maintenance window to avoid this issue.”

The efix can be found here. This package includes these fixes:

  • Fixed a problem where managed systems lose profiles and profiles get corrupted resulting in Recovery state which prevent the ability to do DLPAR/LPM.
  • Fixed a security vulnerability with the HMC help content.

As noted, this is the statement IBM released in July, before the fix became available. The fix–MH1263 PTF–is now out, so be sure to install it.

Again, from IBM:

“Abstract: HMC / SDMC Save Corruption Exposure
Systems Affected: All 7042s
Communicable to Clients: Yes

“Description:
IBM has learned that HMCs running V7R7.3.0 or SDMC running V6R7.3.0 could potentially be exposed to save area corruption (where partition profile data is stored).

“Symptoms include loss of profiles and/or recovery state due to a checksum failure against the     profiles in the save area. In addition, shared processor pools names can be affected (processor pool number and configuration are not lost), system profiles lost, virtual ethernet MAC address base may change causing next partition activation to fail or to have different virtual Ethernet MAC addresses, loss of a default profile for all or some of the partitions.

“Partitions will continue to run, but reactivation via profile will fail if the profile is missing or corrupted. All mobility operations and some DLPAR operations will fail if a partition has missing or corrupted profiles.

“Environments using HMCs or SDMCs to control multiple managed systems have the greatest exposure. Triggers for exposure include any of the following operations performed in parallel to any managed system: Live Partition Mobility (LPM), Dynamic LPAR (DLPAR), profile changes, partition activation, rebuild of the managed system, rebooting with multiple servers attached, disconnecting or reconnecting a server, hibernate or resume, or establishing a new RMC connection.

“Recommended Service Actions:
Prevention/Workaround:
There is no real work-around other than limiting the configurations to a single HMC managing a single managed system.

“Customers who have not yet upgraded or installed HMC 7.7.3 should delay the upgrade/install if at all possible until a fix is available.

“Customers who have not yet installed and deployed SDMC 6.7.3.0 should avoid discovering     production servers until a fix is available.

“Customers that have 7.7.3 or SDMC 6.7.3.0 deployed should:

  • Immediately do a profile backup operation for all managed servers:

    bkprofdata -m <managed system name> -f <filename>

  •  Minimize the risk of encountering the problem by using only a single HMC or SDMC to  manage a single server via the following options:
  1. Power off dual HMC/SDMC or remove the connection from any dual HMC/SDMC.
  2. Use one HMC per server (remove/add connections as needed if necessary).
  3. A single HMC/SDMC managing multiple servers might be done relatively safely if the operations listed under triggers above are NOT done to two different servers concurrently.

“Recovery:
 NOTE: Recovery will be easiest with a valid backup of the profile data. So it is extremely important to backup profile data prior to an HMC upgrade or after any configuration changes to the save area. If a    profile data backup exists this problem can be rectified by restoring using:

    rstprofdata -m <managedsysname> -l 3 -f <backupfilename>

“In addition to user backups, profile backups can be extracted from the previous save upgrade data (DVD or disk); a backup console data (if available); or pedbg.

“If a good backup does not exist, call your HMC/SDMC support to determine if recovery is possible.

 “Fix:
A fix to prevent this from occurring is due out by the end of July (Editor’s note: We realize this is now available but wanted to include the verbiage for completeness), but the PTF will not fix an already corrupted save area. A follow-up notification will be sent as soon as it is available.”

Please heed the warnings and load this fix as soon as possible if you’re running V7R7.3.0. And don’t run any HMCs at V7R7.3.0 while running others at a lower level.

AIX 4 admins website

Nice site by Balazs Babinecz

http://aix4admins.blogspot.com/

This blog is intended for anyone who is working with AIX and encountered problems and looking for fast solutions or just want to study about AIX. This is not a usual blog, it is not updated every day. I tried to organize AIX related subjects into several topics, and when I find new info/solutions/interesting stuff I will add it to its topic.

IBM Systems Magazine videos on Youtube

https://www.youtube.com/user/ibmsystemsmag/videos

Videos by Rob McNelly

Solving dependency issues with rpm

https://www.youtube.com/watch?v=imlE8ogyCQM

Running screen

https://www.youtube.com/watch?v=HoZj4LMO1mM

Running HMC Scanner

https://www.youtube.com/watch?v=5YxOgS8uhOo

Running System Planning Tool to Document Servers

https://www.youtube.com/watch?v=w-R4zEZYad0

vncserver

https://www.youtube.com/watch?v=_8pfPrSYsG4

To VIOS or Not to VIOS

Edit: I assume the VIO server is right for you. Some links no longer work.

Consider whether the Virtual IO Server is right for you

Originally posted September 2010 by IBM Systems Magazine

I attended an OMNI user group meeting a while ago and during the meeting, someone mentioned the difference between attending an education event where you’re familiar with the topic versus a topic you know little about. While you’ll probably learn something at the familiar event, it may only be 1-2 percent added knowledge and a lot of repeated information. But at an event that’s unfamiliar, 50-60 percent of the material may be new to you and it might feel like you’re drinking from a fire hose as you try to digest all of these new ideas and concepts.

At one event, you’re comfortable. At the other, you can feel overwhelmed or wonder why you don’t already know these concepts. Rather than beat yourself up about the knowledge that you haven’t been exposed to yet, see it as an opportunity to learn something new.

Nowhere was that concept clearer at that meeting than during a discussion about whether or not to use the Virtual IO Server (VIOS).

To VOIS or Not to VOIS

Several people had a lively discussion around the pros and cons of virtual IO, but it was clear to me that many were unfamiliar with or misunderstood the capabilities of VIOS. They kept trying to compare VIOS to the managed partition they remembered from years past, which seemed to be all bad memories. They worried about VIOS being a single point of failure or adding a layer of complexity to their server. I’m not certain the IBM i world is fully on board with this solution yet.

At the meeting, it took a while to dispel the myths. Those of us with VIOS experience explained that you can have dual VIO servers so that VIOS is no more of a single point of failure than internal disks would be. With PowerVM virtualization and VIOS, you can continue to add more LPARs to your frame as long as you have available CPU and memory. You don’t have to spend more money for adapters or disks, which leads to lower overall costs compared with dedicated adapters and disks. Using VIOS, you could very easily set up test systems on the same frame as your production systems using this scenario. Rapid provisioning becomes a reality when your environment is virtualized, as you’re not making any changes to physical hardware.

Using VIOS, you could share your storage environment and ‘play nice’ with the rest of the servers in the organization. Instead of people saying that you have an oddball/proprietary/expensive/closed machine sitting in the corner, you can tell them that besides running IBM i, you can also run AIX or Linux—all on the same frame, all sharing the same back-end storage-area network (SAN) and the same network and disk adapters.

Once the meeting attendees understood what you could do with VIOS, and they realized you can pretty much set it up and forget it (until you need to deploy new partitions, and even then it’s a straightforward process), it seemed to me that some warmed up to the idea of virtualizing using VIOS.

More recently, a midrange.com thread entitled “To use VIOS or Not to use VIOS, that is the question” discussed the same types of concerns about complexity and which systems should be primary or guest partitions.

I’ve written twice before on IBM i and VIOS—in a “My Love Affair with IBM i and AIX” blog entry and an article called “Running IBM i and AIX in the Same Physical Frame”—and I think the whole issue boils down to time, availability and training. It takes time to get comfortable with something new. None of us started working on IBM i and were experts in it within a week. It took time to become proficient. The same can be said for VIOS. If you come from a UNIX background, it can help, but the padmin user interface is foreign even to AIX administrators the first time they log into it. Things are just different enough that AIX admins have to learn the padmin/VIOS interface the same way that IBM i admins do. One great resource to start with is the VIO Cheat Sheet.

Real-World Experience

How do you learn VIOS if you don’t have VIOS to play with? Without a test box to work on it can be difficult to learn and understand. You can read IBM Redbooks publications such as “Virtual I/O Server Deployment Examples” and attend lectures on the topic (I recommend the Central Region Virtual Users group, for replayed lectures on many topics, including VIOS configuration overviews), but without hands on experience, it can be difficult to become proficient. I’d argue that this is the same as hiring a new IBM i admin, but then asking him to read manuals and Redbooks publications, without ever letting him log into the machine. He’ll probably not be very effective. With time, access to a server running VIOS and training, anyone can become comfortable with it.

Recently, a customer had a new POWER7 770 server that they were adding 25 AIX and two VIOS partitions to. No problem. I loaded VIOS on the internal disks, and the AIX partitions all booted from SAN. They wanted to get their feet wet with IBM i on the 770, and they wanted to see how it would perform using SAN disks instead of internal disks. No problem. I assigned the proper CPU and memory like I would for any new partition, but I didn’t assign any real IO devices. I assigned it virtual SCSI adapters and a virtual network adapter. It was getting its disk from a SAN. It was going to boot from SAN. I didn’t even use physical media to install it; I just used a virtual optical device in the VIO server and booted the LPAR from there. I used the open source tn5250 program to connect to the console, and we were able to load IBM i on the machine to test it out. They were very pleased with the performance that they saw with the SAN and the POWER7 server.

Make an Informed Decision

Of course, one size doesn’t fit all and there are plenty of great reasons to exclude VIOS from your environment. Maybe you don’t have the need for multiple workloads or virtualization on Power hardware. Maybe you don’t have a SAN in your environment and don’t see one coming any time soon. But don’t let fear of the unknown or memories of the way things once were steer your current thinking around virtualization. Make yourself aware of the pros and cons, and make an informed decision.

Seamless Transitions

Edit: Have you upgraded yet?

The upgrade to AIX 7 is hassle free and benefit rich

Originally posted September 2010 by IBM Systems Magazine

Last month, IBM announced AIX* 7 and its general availability date of Sept. 10. Companies with current IBM software-maintenance agreements receive this upgrade at no charge, meaning adoption should be swift. Technologists in your company will likely be eager to schedule operating-system upgrades and start using the new features. Because of the open beta, many are already testing it. See “Open Beta” for more details.

[The AIX* 7 open beta program, where you could freely download and test the operating system, has been ongoing this summer, and downloads are scheduled to continue through October. Many of your AIX administrators and IT staff have already downloaded the AIX 7 images and have begun testing the new operating system.

You can install the open beta onto your POWER4* or better hardware, but you can’t take that open beta installation and then upgrade or migrate it. You’ll need to do a fresh reinstallation of AIX 7 after it’s generally available. The open beta is meant for test systems and becoming familiar with the operating system and its new features, not for production workloads. Assume that everything you do on this test machine will need to be redone after installing from the official release media.

—R.M.]

Why 7?

Take some care when calling AIX 7 a new version of the operating system; it’s really more of an evolution or continuation of AIX 6. The upgrade from AIX 5.3 to AIX 6 was considerably more extensive than the change from AIX 6 to AIX 7, which might be considered a fine-tuning. For instance, the operating-system default parameters make more sense when we do a fresh install of AIX 6 compared with the tuning and tweaking needed with a fresh installation of AIX 5.3, and AIX 7 will continue with the default settings making sense for the majority of customers.

Some people wanted to call this new release AIX 6.2, but IBM went with AIX 7 in part because of the POWER7* hardware releases. Don’t let the name make you worry about switching in your environment. According to IBM Marketing Manager Jay Kruemcke, “If you’ve been waiting to upgrade, now’s a good time to do so.” Kruemcke points to the binary-compatibility guarantee—where IBM states: “Your applications, whether written in house or supplied by an application provider, will run on AIX 7 if they currently run on AIX 6 or AIX 5L—without recompilations or modification”—and IBM’s great history of binary compatibility throughout the years.

Most IT staff will make time in their busy schedules to test new versions of operating systems as soon as they can. With open beta, they may have already reported the results of their testing and be making the case for moving to AIX 7 now. The case is strong.

If you’ve been waiting to upgrade, now’s a good time to do so. —Jay Kruemcke, IBM marketing manager

The Power of POWER

As you consider AIX 7, it’s important to know what version of POWER* systems (or older RS/6000* systems) and the operating system your company is running. Many of you may be surprised to discover that you’re running AIX 5.2 on older hardware. This version of the AIX operating system was withdrawn from marketing in July 2008, but, for whatever reason, some companies still need it running in their environments. This old machine is typically hosting an application that can’t be upgraded—or may not be worth the effort to upgrade—and it’s typically running on older, slower, less energy-efficient, nonvirtualized hardware. “When that’s the case, you’re missing out on great performance enhancements, new features and cost savings,” Kruemcke says.

Although AIX 7 can run on POWER4* or later hardware, consider running it on POWER7 hardware. A huge benefit of AIX 7 running on POWER7 hardware is the capability to collect those older AIX 5.2 operating-system images, take a system backup (mksysb), and install that AIX 5.2 backup image without modification into an AIX 7 workload partition (WPAR). Once your mksysb image has been created and moved to your POWER7 system, you can give a flag to the WPAR creation command (mkwpar) and restore that backup image into a WPAR running inside AIX 7. Since these AIX 5.2 WPARs will run on top of AIX 7, you’ll also benefit from POWER7’s simultaneous multithreading with four threads and greater performance. This is an excellent way to consolidate old workloads running on less-efficient hardware.

You should immediately see improved performance after moving your workload to a POWER7 server from older hardware, and you’ll enjoy all of the benefits of virtualization on new hardware. “Customers who’ve never looked at WPARs before will take a second look,” Kruemcke says.

Moving your AIX 5.2 system to POWER7 hardware makes it part of an LPAR. It can be part of a micropartitioned pool of processors and donate any idle cycles back into the shared processor pool, and it can have its disk and networking virtualized and handled through VIO servers. The whole LPAR can move to another POWER7 machine in your environment using Live Partition Mobility, or just the WPAR itself can move to another POWER7 machine via Live Application Mobility. Your older operating system can now benefit from all of the advantages of the latest technology, without upgrading the operating system and application.

If you choose to run AIX 5.2 in a WPAR, you’ll have access to IBM phone support, and the operating system will have patches available for critical issues. Instead of needing extended IBM support contracts for your AIX 5.2 machines, you can get ongoing support through your regular maintenance contracts.

Why WPARs?

Nigel Griffiths, Power Systems* technical support, IBM Europe, says companies will see a quadruple win with this move: They’ll remove end-of-life slower machines from their environments, do away with the higher electricity costs of those older machines, eliminate the higher hardware-maintenance costs for those older machines, and decrease the data-center footprint of machine and network cabling.

“WPARs have some great advantages over LPARs,” Griffiths says. “WPARs can be created faster than LPARs, LPARs need more memory to boot compared to WPARs, and you can share application code between multiple WPARs compared to having the same application sitting across LPARs, to name a few.”

Although the WPAR adoption rate has been slow so far, Kruemcke says new WPAR capabilities will cause more people to consider them. Besides running AIX 5.2 in a WPAR, you’ll also have support for NPIV and VIOS storage with WPARs in AIX 7, as the operating system includes support for exporting a virtual or physical Fibre channel adapter to a WPAR. In the new release, the adapter will be exported to the WPAR in the same manner as storage devices.

If you’re running AIX 5.3 on POWER7 hardware, keep in mind that you’re running in POWER6* compatibility mode and aren’t fully exploiting the new hardware. “Since you can upgrade directly from AIX 5.3 to AIX 7, it makes sense to do that upgrade and enjoy the performance benefits of running AIX 7 in POWER7 mode on POWER7 hardware,” Kruemcke says.

What’s New?

Although not a major change, AIX 7 boasts some nice new features.

1,024 threads. AIX 7 supports a large LPAR running 1,024 threads, compared with 256 threads in AIX 6. This large LPAR contains 256 cores, and each core can run four threads, providing the capability to run 1,024 threads in a single operating-system image. If your business needs a very large machine running a massively scaled workload, this thread boost will be a huge benefit. Even if you don’t think you need the capability, it’s nice to know you can migrate your workload into this large environment if needed.

AIX Profile Manager. Besides the massive scalability and the capability to run AIX 5.2 in a WPAR, AIX 7 also supports the AIX Profile Manager, formerly known as the AIX Runtime Expert. An IBM system director plugin, AIX Profile Manager provides configuration management across a group of systems. This lets you see your current system values, apply new values across multiple systems and compare values between systems. Configuring and maintaining your machines can be easier, and you can verify that machine settings haven’t changed over time. You can also set up one machine, then copy its properties across multiple systems. These profiles and properties might include environment variables, tuneables and security profiles.

Systems Director. AIX 7 has also made a change in Web-based System Manager (WebSM), which now integrates with IBM Systems Director and is called the IBM Systems Director Console for AIX. This provides a Web-based management console for AIX so systems administrators have centralized access to do tasks like viewing, monitoring and managing systems. This tool will let staff manage systems using distributed command execution and use familiar interfaces such as the System Management Interface Tool from a central management control point.

Language support. As more companies around the globe deploy AIX 7, they’ll be happy to know that it supports 61 languages and more than 250 locales based on the latest Unicode technology. Unicode 5.2 provides standardized character positions for 107,156 glyphs, and AIX 7 complies with the latest version. This will make the operating system and applications more accessible for non-English speakers.

Updated shell environment. AIX 7 now provides a newly updated version of the ksh93 environment. AIX 6 provided a ksh93 based upon the ksh93e version of the popular shell. AIX 7 now updates ksh93 to be based upon ksh93t. Users now have access to a variety of enhancements and improvements that the Korn shell community has made over the past several years, resulting in a more robust shell programming experience. Many customers complain about needing to learn to get around in the Korn shell, and AIX 7 should help them see improvements when they run the set –o viraw command. They’ll then have access to tab completion and moving through their shell history file using the arrow keys instead of vi commands. Users of other shells from other operating systems will have one less thing to learn on AIX.

Role-based access control. Many companies still rely on sudo to give nonroot users root user functionality. AIX 7 continues supporting role-based access control (RBAC) but enhances it by providing resource isolation. In previous iterations of RBAC, if you gave someone access to change a device, they could change any device of that type. Now you can limit their access to a specific device on the system. This lets you give a nonroot user access to resources that they can manage, and have more granular control over what they can do.

Clustering. Another highlight of this announcement is the clustering technology that’s being built into the operating system. AIX 7 now has built-in kernel-based heartbeats and messages, and multichannel communication between nodes. It also features clusterwide notification of errors and common naming of devices across nodes. This will let multiple machines see the same disk and have it be called the same name. Built-in security and storage commands support operations across the cluster.

You used to have to purchase HACMP or PowerHA* products and install them on top of AIX to get these features, but now much of that functionality is built into the operating system and better integrated with AIX 7. This should make implementing high-availability clusters easier for administrators.

Continued Investment

IBM continues to make investments in the Power Systems hardware, AIX and software. You can be sure that IBM will continue to stand behind the investments you’re making well into the future. So take AIX 7 for a spin and enjoy the new features.

Those Who Do Without Virtualization

Edit: Still a good topic.

Originally posted November 30, 2010 by IBM Systems Magazine

Working on virtualized systems as much as I do, and talking to people about virtualization as often as I do, I tend to forget a couple things:

  1. Not all IBM Power Systems users have virtualized systems.
  2. Not all of them use VIOS even while they benefit from other aspects of virtualizing their machines.

It isn’t necessarily that these shops are limited by the constraints of older hardware and operating systems. I know of customers with POWER6 and POWER7 hardware that haven’t yet virtualized their systems. Maybe they lack the time or the resources to virtualize more fully, or maybe they simply lack the skills that come only with hands-on experience.

Customers who aren’t hands-on generally don’t realize that virtualization covers a wide range of functionality. Using workload partitions (WPAR) counts as virtualization. Micropartitioning CPU, where we assign fractions of a CPU to an LPAR and then set up processing entitlements and cap or uncap partitions based on our LPAR’s requirements? That’s virtualization. We use VIOS to virtualize disk, the network or both. NPIV allows us to virtualize our fibre adapters and have our clients recognize the LUNs we provision–and it saves us the effort of having to map them to the VIOS and remap them to the VIOS client LPARs. We use the built-in LHEA to virtualize the network. We could create an LPAR with some dedicated physical adapters and some virtual adapters. We could use active memory sharing and active memory expansion to better utilize our systems’ memory. Power Systems offers many choices and scenarios where it can be said that we’re using virtualized machines.

I know some administrators who’ve been unable to convince their management or application vendors of virtualization’s benefits. I know of some IBM i users who are reluctant to get on board with VIOS (though plenty of AIX shops still don’t virtualize, either). Sometimes it’s the vendor that lacks the time, resources or skills for virtualization. For instance, I’ve seen multiple customer sites where tons of I/O drawers are used; the vendor won’t officially support VIOS because the vendor hasn’t tested it, and these customers don’t want to run an unsupported configuration.

I talked to an admin who has experience with configuring logical partitions, setting up dedicated CPUs and dedicated I/O slots in his environment, but he continues to use a dynamic logical partition (DLPAR) operation to move a physical DVD between his different LPARs. It’s the way he’s always done it. He figures that since his shop doesn’t use virtualization is no big deal, since he has no experience with VIOS and virtual optical media anyway. “You can’t miss what you’ve never had,” is how he put it.

Others will tell me that they the see the writing on the wall. They insist they’ll virtualize, some day.

Are there roadblocks keeping you from virtualizing? Are there complications that prevent you from moving to a fully virtualized environment? I’d like to hear about the challenges you face. Please e-mail me or post in Comments.

The Evolution of Education

Edit: Link no longer works.

Originally posted June 29, 2010 by IBM Systems Magazine

As more companies migrate to IBM Power Systems hardware, the need for education grows. It may be hard for us long-time users to imagine, but every day, seasoned pros are just getting started on POWER hardware.

While I’ve provided customer training, what I do–either through giving lectures on current topics or talking to people informally as their systems get built–doesn’t compare to the educational value of a “traditional” instructor-led class or lab.

With that in mind, check into the IBM Power Systems Test Drive, a series of no-charge remote (read: online) instructor-led classes.

Courses being offered include:

IBM DB2 WebQuery for IBM i (AT91)
IBM PowerHA SystemMirror for IBM AIX (AT92)
IBM PowerHA and Availability Resiliency without Downtime for IBM i (AT93)
Virtualization on IBM Power (AT94)
IBM Systems Director 6.1 for Power Systems (AT95)
IBM i on IBM Power Systems (AT96)
IBM AIX on IBM Power Systems (AT97)

Remote training, of course, saves IT pros and their employers the time and expense of having to travel to an educational opportunity. But is something lost if students, instructor and equipment aren’t in the same room? Not necessarily. Let’s face it: Nowadays a lot of education is remote anyway–when you travel to classes and conferences and do lab exercises, you’re likely logging into machines that are located offsite. By now good bandwidth is the norm, so network capacity shouldn’t be an issue when it comes to training.

Sure, offsite training has its advantages. When you travel somewhere for a class, there are fewer distractions, so you can concentrate on the training. Taking training remotely from your office desk, it’s easy to be sidetracked by your day-to-day responsibilities. (This does cut both ways though–I often see people connect to their employer and work on their laptops during offsite training.)

Offsite training also allows you to meet and network with your peers. I still keep in touch with folks I’ve met at training sessions. If I run into a problem with a machine I’m working on, I have any number of people I can contact for help. Being able to tap into that knowledge with just a call or a text message is invaluable.

While I haven’t taken a remote instructor-led class like the ones IBM offers, I’ve heard positive feedback from those who have. But what about you? I encourage you to post your thoughts on training and education in comments.

The Importance of the Academic Initiative

Edit: Some links no longer work.

Originally posted May 18, 2009 by IBM Systems Magazine

In a previous blog entry titled, “Some New Virtual Disk Techniques,” I said that I usually learn something new whenever I attend or download the Central Region Virtual User Group meetings from developerWorks.

For instance, at the most recent meeting, Janel Barfield gave a typically excellent presentation on Power Systems Micro-paritioning. But for this post I want to focus on the IBM Academic Initiative. IBMer Linda Grigoleit took a few minutes to cover material about the IBM Academic Initiative, which is available to high school and university faculty.

From IBM:

“Who can join? Faculty members and research professionals at accredited institutions of learning and qualifying members of standards organizations, all over the globe. Membership is granted on an individual basis. There is no limit on the number of members from an institution that can join.”

Check out the downloadable AIX and IBM i courses and imagine a high school or college student taking these classes. With this freely available education, these students would be well on their way to walking in the door of an organization and being productive team members from the beginning of their employment. Think about the head start you would have had you been able to study these Power Systems AIX or these IBM i course topics at that age.

Although as I said in a previous AIXchange entry titled, “You Have to Start Somewhere,”  I like the idea of employees starting out in operations or help desks in organizations, the Academic Initiative is a great way for people to get real-world skills on real operating systems.

Instructors also benefit from the program, as IBM offers them discounts on certification tests, training and either discounted hardware or free remote access to the Power System Connection Center.

There’s more. From IBM:

“The Academic Initiative Power Systems team provides vouchers for many IBM instructor-led courses to Academic Initiative members at no cost.

“The IBM Academic Initiative hosts an annual Summer school event for instructors. Each summer this very popular event features topics for those new to IBM i platform.”

Maybe it’s time you get involved. Go to your local high school or university. Find the instructors who would be interested in learning and teaching this technology. Get them to sign up with the Academic Initiative and get involved. With your skills and experience, you could help them get started, and your ongoing assistance would be appreciated by instructors and students alike.

AIX and i Worlds Can Learn from Each Other

Edit: Link no longer works.

Originally posted February 24, 2009 by IBM Systems Magazine

I recently read this iDevelop blog post and it got me thinking. I too have been involved in these discussions with a local IBM i user group that recently had a conference planned. The group was forced to cancel the event due to lack of attendance. Was that due to less and less actual users of the platform? Was that due to budget constraints or time constraints, where people just couldn’t take the time to spend a day away from the office? Or had people lost their jobs because their companies went out of business? The conference planners are not sure. All they know is that they wanted to attract enough bodies to their event to cover costs, so they thought that a combined i and AIX conference would be a good thing.

By combining their conference, they had hoped to introduce i people to AIX and Linux. They planned to offer some introductory level tracks so that IBM i people could learn more about AIX. At the same time, introductory tracks were planned to give AIX administrators a better understanding of the benefits of IBM i. But besides the intro classes, power-user sessions were planned, aimed at the serious administrators from both camps.

I was at virtual I/O server (VIOS) training last year that was aimed toward users of IBM i, and it seemed to me that this group didn’t want to hear the message that was being delivered. Instead of trying to understand how IBM i using VIOS attached to external storage would be a good thing to consider, they seemed to focus on how this would be a different way of doing things and they seemed resistant to learning about it.

I also attended an IBM event that featured technical lectures for both i and AIX users, and I watched IBM i users walk out, because they said the event was too slanted toward AIX.

I can certainly agree with the points that the authors of the iDevelop blog make, where you might think that people are watering down content or leaving out sessions in order to accommodate both groups. However, combining events like this might also be an advantage to the attendees. Many shops run IBM i, but they are also running HP servers, Sun servers–some flavor of UNIX. This means that besides the investment in IBM i, these shops are also investing in other vendors’ solutions.

Instead of using all of this different hardware from all of these different vendors, why not consolidate and virtualize the Power Systems server running IBM i in a partition and some number of AIX LPARs in other partitions? While this seems pretty straightforward to someone with an AIX background because we think nothing of running different operating systems and different versions of the same operating system on the same frame, some IBM i people might not realize that this is possible, or what the benefits might be.

There might be discussions in some organizations about eliminating that IBM i machine that just sits in the corner and runs, and taking that workload and running it on Windows or Linux or some flavor of UNIX. If all you understand is IBM i, it might be difficult to articulate its pros and cons versus the other operating systems. There can be a perception that IBM i is still a green-screen 1988 legacy system, instead of a powerful integrated operating system that frankly could use better marketing and education so that more organizations were made aware of its benefits.

If IBM i administrators aren’t keeping up on the trends in the UNIX space, they might be missing a great opportunity to help extend the longevity of their IBM i investments, both in hardware and knowledge. By running more of their company’s workloads on the same hardware from the same vendor, they are now benefiting from having “one throat to choke” if things go wrong, but better than that, they are running the best server hardware currently available.

The problem is, without understanding the basics of AIX and VIOS, and why it can all coexist happily on the same hardware, IBM i administrators might have a difficult time making the case to their management team that this server consolidation could be the way to go.

The IBM technical university that was held in Chicago last fall was a great example of how this can be done–hold tracks that appeal to IBM i administrators and those that appeal to traditional AIX administrators. Let attendees freely move between tracks so that they can learn more about the “other side.” Although they won’t become experts after a few sessions, they should at least start to understand the lingo, the jargon and the benefits that come from the other operating system. AIX administrators might be surprised to learn just how good IBM i is, while i administrators might also be pleasantly surprised to learn just how good AIX is.

Change can be scary, change can be hard, but change will come. How will we deal with it? Will we try to keep our traditional user groups doing the same old thing or will we try to learn more about other technologies? By telling the IBM i story to AIX administrators, at a minimum there will be more people out there that understand the basics of why it is so good, and who might be eager to make the case to management that consolidation might make sense.

An LPAR Review

Edit: Some links no longer work.

Originally posted September 2009 by IBM Systems Magazine

To learn more about this topic, read these articles:
Software License Core Counting
Trusted Logging Simplifies Security
Tools You Can Use: Planning and Memory
Improve Power Systems Server Performance With Virtual Processor Folding
Now’s the Time to Consider Live Partition Mobility
Improve Power Systems Server Performance With Enhanced Tools
How to Use rPerfs for Workload Migration and Server Consolidation
Entitlements and VPs- Why You Should Care
Three Lesser-Known PowerVM Features Deliver Uncommon Benefits

In 2006 IBMer Charlie Cler wrote a great article that helps clear up confusion regarding logical, virtual and physical CPUs on Power Systems (“Configuring Processor Resources for System p5 Shared-Processor Pool Micro-Partitions”). This subject still seems to be a difficult concept for some people to grasp, particularly those who are new to the platform or are unfamiliar with the topic. But if you put in the research, there are a lot of quality resources available.

I recently saw Charlie give a presentation to a customer where he covered this topic again, and I based this article on the information that he gave us that day, with his permission.

When you’re setting up LPARs on a hardware management console (HMC), you can choose to have dedicated CPUs for your LPAR, which means an LPAR exclusively uses a CPU; it isn’t sharing CPU cycles with any other LPAR on the frame. On POWER6 processor-based servers you can elect to have shared, dedicated processors–where the system allows excess processor cycles from a dedicated CPU’s LPAR to be donated to the shared processor pool.

Instead of using dedicated or shared dedicated CPUs, you could choose to let your LPAR take advantage of being part of a shared pool of CPUs. An LPAR operates in three modes when it uses a shared pool: guaranteed, borrowing and donating. When your LPAR is using its entitled capacity, it isn’t donating or borrowing from the shared pool. If it’s borrowing from the pool, then it’s going over its entitled capacity and using spare cycles another LPAR isn’t using. If the LPAR is donating, then it isn’t using all of its entitlement, but returning its cycles to the pool for other LPARs to use.

In his presentation, Cler shared some excellent learning points that I find useful:

  • The shared processor pool automatically uses all activated, non-dedicated cores. This means any capacity upgrade-on-demand CPUs that were physically installed in the frame but not activated wouldn’t be part of the pool. However, if a processor were marked as bad and removed from the pool, the machine would automatically activate one of the deactivated CPUs and add it to the pool.
  • The shared processor-pool size can change dynamically as dedicated LPARs start and stop. As you start more and more LPARs on your machine, the amount of available CPUs continues to decrease. Inversely, as you shut down LPARs, more CPUs become available.
  • Each virtual processor can represent 0.1 to 1 of a physical processor. For any given number of virtual processors (V), the range of processing units that the LPAR can utilize is 0.1 * V to V. So for one virtual processor, the range is 0.1 to 1, and for three virtual processors, it’s 0.3 to 3.
  • The number of virtual processors specified for an LPAR represents the maximum number of physical processors the LPAR can access. If your pool has 32 processors in it, but your LPAR only has four virtual CPUs and it’s uncapped, the most it’ll consume will be four CPUs.
  • You won’t share pooled processors until the number of virtual processors exceeds the size of the shared pool. If you have pool with two LPARs and four CPUs, and each LPAR had two virtual CPUs, there would be no benefit to sharing CPUs. As you start adding more LPARs and virtual CPUs to the shared pool, eventually you’ll have more virtual processors than physical processors. This is when borrowing and donating cycles based on LPAR activity comes into play.
  • One processing unit is equivalent to one core’s worth of compute cycles.
  • The specified processing unit is guaranteed to each LPAR no matter how busy the shared pool is.
  • The sum total of assigned processing units cannot exceed the size of the shared pool. This means you can never guarantee to deliver more than you have available; you can’t guarantee four CPUs worth of processing power if you only have three CPUs available.
  • Capped LPARs are limited to their processing-unit setting and can’t access extra cycles.
  • Uncapped LPARs have a weight factor, which is a share-based mechanism for the distribution of excess processor cycles. The higher the number, the better the chances the LPAR will get spare cycles; the lower the number, the less likely the LPAR will get spare cycles.

When you’re in the HMC and select the desired processing units, it establishes a guaranteed amount of processor cycles for each LPAR. When you set it to “Uncapped = Yes,” an LPAR can utilize excess cycles. If you set it to “Uncapped = No,” an LPAR is limited to the desired processing units. When you select your desired virtual processors, you establish an upper limit for an LPAR’s possible processor consumption.

Charlie gives an example of an LPAR with two virtual processors. This means the assigned processing units must be somewhere between 0.2 and 2. The maximum processing units the LPAR can utilize is two. If you want this LPAR to use more than two processing units worth of cycles, you need to add more virtual processors. If you add two more, then the assigned processing units must now be at least 0.4 and the maximum utilization is four processing units.

You need to consider peak processing requirements and the job stream (single or multi-threaded) when setting the desired number of virtual processors for your LPAR. If you have an LPAR with four virtual processors and a desired 1.6 processing units–and all four virtual processors have work to perform–each receives 0.4 processing units. The maximum processing units available to handle peak workload is four. Individual processes or threads may run slower, while workloads with a lot of processes or threads may run faster.

Compare that with the same LPAR that now has only two virtual processors instead of four, but still has a desired 1.6 processing units. If both virtual processors have work to be done, each will receive 0.8 processing units. The maximum processing units possible to handle peak workload is two. Again, Individual processes or threads may run faster, while workloads with a lot of processes or threads may run slower.

If there are excess processing units, LPARs with a higher desired virtual-processor count are able to access more excess processing units. Think of a sample LPAR with four virtual processors, desired 1.6 processing units and 5.8 processing units available in the shared pool. In this case, each virtual processor will receive 1.0 processing units from the 5.8 available. The maximum number of processing units that can be consumed is four, because there are four virtual processors. If the LPAR only has two virtual processors, each virtual processor will receive 1.0 processing units from the 5.8 available, and the maximum processing units that can be consumed is two, because we only have two virtual processors.

The minimum and maximum settings in the HMC have nothing to do with resource allocation during normal operation. Minimums and maximums are limits applied only when making a dynamic change to processing units or virtual processors using the HMC. The minimum setting also allows an LPAR to start with less than the desired resource allocations.

Another topic of importance Cler covered in his presentation is simultaneous multi-threading (SMT). According to the IBM Redbooks publication “AIX 5L Performance Tools Handbook (TIPS0434, http://www.redbooks.ibm.com/abstracts/tips0434.html?Open), “In simultaneous multi-threading (SMT), the processor fetches instructions from more than one thread. The basic concept of SMT is that no single process uses all processor execution units at the same time. The CPU design implements two-way SMT on each of the chip’s processor cores. Thus, each physical processor core is represented by two virtual processors.” Basically, one processor, either dedicated or virtual, will appear as two logical processors to the OS.

If SMT is on, AIX will dispatch two threads per processor. To the OS, it’s like doubling the number of processors. When “SMT = On,” logical processors are present, but when “SMT = Off,” there are no logical processors. SMT doesn’t improve system throughput on a lightly loaded system, and it doesn’t make a single thread run faster. However, SMT does improve system throughput on a heavily loaded system.

In a sample LPAR with a 16 CPU shared pool and SMT on, 1.2 Processing Units, three virtual processors and six logical processors: the LPAR is guaranteed 1.2 processing units at all times. If the LPAR isn’t busy, it will cede unused processing units to the shared pool. If the LPAR is busy, then you could set the LPAR to capped which would limit the LPAR to 1.2 processing units. Alternatively, uncapped would allow the LPAR to use up to three processing units, since it has three virtual processors.

To change the range of spare processing units that can be utilized, use the HMC to change desired virtual processors to a new value between the minimum and maximum settings. To change the guaranteed processing units, use the HMC to change desired processing units to a new value between the minimum and maximum settings.

When you think about processors, you need to think P-V-L (physical, virtual, logical). The physical CPUs are the hardware on the frame. The virtual CPUs are set up in the HMC when we decide how many virtual CPUs to give to an LPAR. The logical CPUs are visible and enabled when we turn on SMT.

When configuring an LPAR, Cler recommends setting the desired processing units to cover a major portion of the workload, then set desired virtual processors to match the peak workload. LPAR-CPU utilization greater than 100 percent is a good thing in a shared pool, as you’re using spare cycles. When you measure utilization, do it at the frame level so you can see what all of the LPARs are doing.

There’s a great deal to understand when it comes to Power Systems and the flexibility that you have when you set up LPARs. Without a clear understanding of how things relate to each other, it’s very easy to set things up incorrectly, which might result in performance that doesn’t meet your expectations. However, by using dynamic logical-partitioning operations, it can be easy to make changes to running LPARs, assuming you have good minimum and maximum values. As one of my colleagues says, “These machines are very forgiving, as long as we take a little care when we initially set them up.”

Other Resources

IBM developerWorks
Virtualization Concepts

IBM Redbooks publications
PowerVM Virtualization on IBM System p: Introduction and Configuration Fourth Edition” (SG24-7940-03)

IBM PowerVM Virtualization Managing and Monitoring” (SG24-7590)

IBM Systems Magazine articles
Mapping Virtualized Systems

Shared-Processor Pools Enable Consolidation

My Love Affair with IBM i and AIX

Edit: Another good one.

Originally posted December 1, 2008 by IBM Systems Magazine

I started my IT career in 1988 as a computer operator, specializing in AS/400 servers running OS/400. It was love at first sight. The commands made sense, I learned to love WRKACTJOB and WRKJOBQ and QPRINT and QBATCH. I’d happily vary on users who’d tried to log onto their green-screen terminal too many times with the wrong password. I’d configure my alphanumeric pager to send me operator messages that were waiting for a reply. Printing reports on green bar paper and changing to different print forms became an art. You had to know where to line up the paper, and when to hit G on the console. You had to manage the backup tapes and the backup jobs. For the most part, interactive response time took care of itself, although occasionally I’d have to hold a batch job during the day if things weren’t running smoothly.

You can call it i5/OS, you can call it IBM i, you can call it whatever you want, but to me it will always be OS/400. Among many people I talk to, the feeling is the same. Name it what you want, just keep supporting and selling it, because we love it.

I worked for three different companies doing AS/400 computer operations, and the OS/400 learning curve wasn’t very steep whenever I made a change. The simplicity and the elegance of the interface was the same. The computers just worked. Sure, the machines ran different applications since the companies were in different industries. In some cases, they were in different countries. The green screens looked the same no matter where I worked. I can remember hardware issues where we would lose a 9336 disk, replace it, and the machine would keep on running. I can remember human error causing issues; however, I don’t remember the operating system locking up like others have been known to do. I can’t remember wishing I were running something else. OS/400 was and is a rock-solid platform on which to run a business.

My head was turned in 1998, and I left my first love and started my affair with AIX. I traded QSECOFR for root. There’s much to be said for AIX and open systems. I also like the way things are structured in this operating system. It can seem familiar to people with Solaris or Linux skills, although there will be new things to learn, like the Object Data Manager (ODM) and Systems Management Interface Tool (smitty). A friend likes to dismiss AIX by calling it “playing with tinker toys.” I can connect the operating system to all kinds of disk subsystems from all kinds of manufacturers. I can use third-party equipment to manage my remote terminal connections if I want to. I can run all kinds of applications from all kinds of vendors. Since it lives in the UNIX world, its heritage is considered to be more open and less proprietary, although I’m sure that open-source adherents and members of the free software foundation would argue that point.

I’ve become accustomed to things taking a certain amount of tinkering to get them to work. I know that I may have to load some drivers, or configure a file in the /etc directory to tell a program how to behave. I have to pay attention to disk consumption, file system sizes, volume groups, etc. I accept all of that as part of the whole package. Some from the IBM i world hear about this and shake their heads and wonder why anyone would put up with it.

Now that POWER servers have been consolidated and AIX and IBM i will run on the same machine, it makes sense to see what can be shared. How can we take our current AIX and IBM i environments and run them all on the same physical frame? During this exploration, I’ve been hearing a great deal of resistance from the i community. Part of this might be a natural response to any kind of change. Change can be scary and painful. However, since I’ve spent a bit of time in both the AIX and IBM i worlds, I think I can safely say it shouldn’t be scary and it definitely isn’t painful. It’s just another set of commands to learn, but once you learn them, it’s just like anything else in IT.

I’ve begun playing with IBM i again and it’s like I never left. I’ve written an article on implementing IBM i using Virtual I/O Servers (VOIS). If you’re an IBM i administrator, the idea of running IBM i as a client of VIOS might sound intimidating, it’s not. In years past, IBM i has hosted AIX and Linux partitions. Using VIOS is the exact same concept, only instead of your underlying operating system being IBM i based, it’s VIOS, which is AIX based. If you want to know why you should bother, check out “Running IBM i and AIX in the Same Physical Frame,” then let me know what you think. I still look back fondly at my first true love, and I’m glad it’s still being well positioned for the future.

If you’re an AIX administrator, offer help to IBM I administrators who might be nervous about running VIOS to connect to external disk. In some larger shops, these teams might not spend much time together, but it’s time to change that mentality.

Run IBM i and AIX in the Same Physical Frame

Edit: Some links no longer work.

POWER technology-based servers allow for consolidation

Originally posted December 2008 by IBM Systems Magazine

As I wrote in my blog titled “My Love Affair with IBM i and AIX”, I started my career working on AS/400 servers running OS/400 – and I loved it. Then I started working on AIX – and I loved that. AIX has been my world for the past decade.

During that time, AIX customers began using Virtual I/O Servers (VIOS) to consolidate the number of adapters needed on a machine. Instead of 10 standalone AIX servers they would have one or two frames and use pools of shared processors. But all these LPARs needed dedicated hardware adapters. So they consolidated again to share these resources. This required a VIOS with shared network adapters and shared Fibre adapters, which reduced the number of required physical adapters.

Now POWER technology-based servers have been consolidated and AIX and IBM i will run on the same machine, it makes sense to see what else can be shared. How can we take our current AIX and IBM i environments and run them on the same physical frame?

The IBM Technical University held this fall in Chicago offered sessions for AIX customers and for IBM i customers. If you were like me, you bounced between them. Great classes went over the pros and cons of situations where IBM i using VIOS as a solution may make sense. Although the idea of running IBM i as a client of VIOS might sound intimidating, it’s not. In years past, IBM i has hosted AIX and Linux partitions. Using VIOS is the same concept, only instead of your underlying operating system being IBM i-based, it’s VIOS, which is AIX-based.

Great documentation has been created to help us understand how to implement IBM i on VIOS. Some is written more specifically for those running IBM i on blades, but it’s applicable whether you’re on a blade or another Power Systems server. Many shops already have AIX skills in house, but if you don’t, it can be very cost-effective to hire a consultant to do your VIOS installation. Many customers already bring in consultants when they upgrade or install new hardware, so setting up VOIS can be something to add to the checklist. You can also opt to have IBM manufacturing preinstall VIOS on Power blades or Power servers.

Answering the Whys

Why would you want to use VIOS to host your disk in the first place? VIOS is able to see more disk subsystems than IBM i can see natively. As of Nov. 21, the IBM DS3400, DS4700, DS4800, DS8100 and DS8300 are all supported when running IBM i and VIOS, and I expect the number of supported disk subsystems to increase. You can also use a SAN Volume Controller (SVC) with VIOS, which lets you put many more storage subsystems behind it–including disk from IBM, EMC, Hitachi, Sun, HP, NetApp and more. This way you can leverage your existing storage-area network (SAN) environment and let IBM i connect to your SAN.

The question remains, why bother with VIOS in the first place? These open-system disk units are expecting to use 512 bytes per sector, while traditional IBM i disk units use 520 bytes per sector. By using VIOS, you’re presenting virtual SCSI disks to your client LPARs (vtscsi devices) that are 512 bytes per sector. IBM i’s virtual I/O driver can use 512 bytes per sector, while none of the current Fibre Channel, SAS or SCSI drivers for physical I/O adapters can (for now). IBM i storage management will expect to see 520 bytes per sector. To get around that, IBM i uses an extra sector for every 4 K memory page. The actual physical disk I/O is being handled by VIOS, which can talk 512 bytes per sector. This, in turn, allows you to widen the supported number of disk subsystems IBM i can use without forcing the disk subsystems to support 520 bytes per sector.

But again, why bother? It’s certainly possible you don’t need to implement this in your environment. If things are running fine, this makes no sense for you. This solution is another tool in the toolbox, and another method you can use to talk to disk. As alternative solutions are discussed in your business, and people are weighing the pros and cons of each, it’s good to know VIOS is an option.

Do you currently have a SAN or are you looking at one? Are you thinking about consolidating storage for the other servers in your environment? Are you considering blade technology? Are you interested in running your Windows, VMware, Linux, AIX and IBM i servers in the same BladeCenter chassis? If you have an existing SAN, or you’re thinking of getting one, it may make sense to connect your IBM i server to it. If you’re thinking of running IBM i on a blade, then you most certainly have to look at a SAN solution. These are all important ideas to consider, and you may find significant savings when you implement these new technologies.

VIOS

When I was first learning about VIOS, a friend of mine said this was the command I needed to share disk in VIOS: mkvdev -vdev hdisk1 -vadapter vhost1

When you think of an IBM i command, mkvdev (or make virtual device) makes perfect sense. I find that to be true of many AIX commands. You give the command the disk name (in this case an hdisk known to the machine as hdisk1) and the adapter to connect it to. On the IBM i client partition, a disk will appear that’s available for use just like any other disk.

To take it from the beginning, you’d have already set up your server and client virtual adapters, and your SAN administrator would zone the disks to your VIOS physical Fibre adapters. You’d log into VIOS as padmin, and after you run cfgdev in VIOS to make your new disks available, you can run lspv (list physical volume) and see a list of disks attached to VIOS.

In my case I see:lspv
NAME PVID VG STATUS
hdisk0 0000bb8a6b216a5d rootvg active
hdisk1 00004daa45e9f5d1 None
hdisk2 00004daa45ebbd54 None
hdisk3 00004daa45ffe3fd None
hdisk4 00004daa45ffe58b None
hdisk5 00004daae6192722 None

This might look like only one disk, hdisk0 in rootvg, is in use. However, if I run lsmap –vadapter vhost3 (lsmap could be thought of as list map, with the option asking it to show me the virtual adapter called vhost3), I’ll see:SVSA     Physloc                    Client Partition
ID
————— ——————————————– —
vhost3   U7998.61X.100BB8A-V1-C17   0x00000004

VTD      vtscsi3
Status   Available
LUN      0x8100000000000000
Backing device    hdisk5
Physloc  U78A5.001.WIH0A68-P1-C6-T2-W5005076801202FFF-L9000000000000

This tells me that hdisk5 is the backing device, and it’s mapped to vhost3, which in turn is mapped to client partition 4, which is the partition running IBM i on my machine.

To make this mapping, I needed to run the mkvdev command:mkvdev -vdev hdisk5 -vadapter vhost3

If I needed to assign more disks to the partition, I could’ve run more mkvdev commands. At this point, I use the disks just as I would any other disks in IBM i.

It might look like gibberish if this is your first exposure to VIOS. Your first inclination may be to avoid learning about it. Don’t dismiss it too quickly. IBM i now has another option when you’re setting up disk subsystems. The more you know about how it works, the better you’ll be able to discuss it.

Although I may find myself more heavily involved with AIX and VIOS, I still look back fondly at my first true love, and I’m glad it’s still getting options added that position it well for the future.

References

www.ibm.com/systems/resources/systems_power_hardware_blades_i_on_blade_readme.pdf

www.ibm.com/systems/resources/systems_i_os_i_virtualization_and_ds4000_readme.pdf

www.redbooks.ibm.com/abstracts/sg246455.html

www.redbooks.ibm.com/abstracts/sg246388.html

www.ibm.com/systems/storage/software/virtualization/svc

Data Protection Versus Risk

Edit: I remember writing this in the airport. Some links no longer work.

Originally posted August 2008 by IBM Systems Magazine

I just took off my shoes, took out my laptop, removed the liquids from my carry-on bags and the metal from my pockets. I walked through a metal detector and my belongings filed through the X-ray machine. This was all done in the name of airport security.

It was inconvenient. It took time to navigate through the lines, show my documents and eventually clear security. It took money to implement and maintain all of the systems in place. However, the inconvenience, lost time and money spent was offset by the idea of keeping attackers out of the system. In fact, the knowledge that these defenses were in place potentially kept many threats at bay.

After all of this, the secure area could still be attacked by determined individuals and organizations. A trusted employee could cause harm. Someone who passed a background check could then turn around and do harm. X-rays and metal detectors are deterrents, but they can’t guarantee that nothing bad will ever happen.

As we all prefer the convenience and time savings we gain when we fly, decisions are made as to acceptable risk and potential inconvenience to travelers. People learn the new security rules and follow them. Airport security and server security involve protecting different things, yet have similar goals. Instead of protecting planes, you want to protect data, keep it on your server safe from attackers and limit system access to authorized personnel.

Planning Network Security

The only secure server is one that’s turned off. After you hit the power button, you have to start trusting people. When you power it on and let people have access, you run the risk of compromised security. You must trust the people that work on the servers. You should decide what activities you’re trying to prevent. Are lives at stake if medical data is updated? Is privacy and financial harm to your customers an issue if Social-Security numbers or credit-card numbers are disclosed? Are trade secrets and confidential business plans at risk if someone has access to sensitive information?

As you think about securing your machines, think about network security, physical security and user security. I’m attempting to get you to think about what you’re doing right, wrong and what you might need to change. This isn’t an all-inclusive list—threats change and evolve, and specifics change—but the basic concepts remain the same.

See the Redbooks* publication, “Understanding IT Perimeter Security” for more information on this topic (www.redbooks.ibm.com/redpapers/abstracts/redp4397.html?Open).

We usually put our machines behind firewalls. In some environments, firewalls aren’t enough, and absolutely no network activity is allowed to the public Internet. Some companies choose to implement different network layers and segregate which machines access which networks. In some environments network traffic isn’t allowed to leave the computer room—you have to use a secure terminal in a secure area to access data. Again, you must weigh what you’re trying to accomplish, from whom you’re trying to protect yourself and what harm will be done if the data you’re trying to secure is compromised.

I once heard about a phone call from a customer to a service provider who was hosting his servers. The customer asked the provider how secure his physical servers were and was told that they were in a raised-floor environment that required a keycard to access. The customer replied that he was standing in front of the servers on the raised floor and he wasn’t happy. Dressed in his normal clothes—not as a maintenance man pretending to do work on the air conditioning unit—the customer gained entry when a friendly, helpful authorized person held the door to the raised floor open for him.

This event caused mantraps to be installed. To access the raised floor, you had to scan your fingerprint and your keycard, and then enter the mantrap one employee at a time. It caused more pain when many people needed to access the raised floor at once, but it was determined that this pain was offset by the gains in knowing exactly who was on the raised floor, and removing the possibility of someone letting any unauthorized people onto the floor. If an attacker gains physical access to the machine, then it’s game over; physically securing the machines is critical.

Tightening Security

Many data centers have locked cages around the servers. In the aforementioned scenario, even if a helpful employee helped you access the raised floor, you’d still need keys to the cages to work on the machines.

This isn’t limited to a raised-floor environment. Once I have access to a desktop machine I can add a small connector between the keyboard and the machine to log keystrokes and capture passwords. I can copy data onto a USB thumb drive. I can boot the OS into maintenance mode and make changes to allow me to access the machine in the future, or change the root password. All I need is a part-time job with a cleaning crew and many machines are vulnerable to attack.

Many people still use Telnet and FTP to access their machines. Both of these programs send their traffic unencrypted over the network. If I trace the network traffic on my machine I can easily capture cleartext passwords. I’d make it a priority to convert to SSH/SCP/SFTP so that the network traffic was encrypted.

SSH has its own problems. Many people like to set up public/private keys and allow themselves to access their servers without passwords. It can be convenient to use one master workstation to connect to all of the machines. By setting up public/private keys you may easily create wonderful tools that allow you to make changes across your environment instead of logging in to each machine individually. If you choose to do this, be sure to protect your private keys. If I steal your keys, I can log on as you. It would be better to create a passphrase and then use SSH-agent instead of having no passphrase at all. Again, you have to weigh the risks versus the benefits. See “References,” below.

On the public Internet, the number of attacks logged against port 22 has been rising. If your SSHD is listening on the public Internet, it might be worth changing the port it’s listening on. This will keep some automated scripts from attacking you, since you won’t be listening to the port that they expect you to, but this is also offset by the pain of notifying everyone that needs to use this newly changed port. This won’t help if the attacker port scans you and finds the newly assigned port, but may help defend against automated tools and unsophisticated attackers.

Final Tips

Run netstat –an more on your machine and look at all the ports that are listening. Do you know what every program is? If not, find out what starts that process and turn it off. Check in /etc/inetd.conf, /etc/inittab, /etc/rc.tcpip etc and turn off unneeded services. You can’t connect to a machine that isn’t listening.

Verify your path is set correctly and think before you type. I’ve heard of attackers that would change root’s path to have “.” at the beginning, which caused it to execute whatever was in the local directory first. Then they had to add a script or program to some directory and get root to run it. Depending on the administrator’s skill level, they might not even know that they just gave away root access to the machine.

Secure, Workable System

Monitoring logs, hardening machines and continually maintaining a secure posture can be painful for both system administrators and for users. You want to keep the data relatively easy to access for those whose jobs demand it, while keeping others out. In both scenarios, the goal is a secure workable system, either in the air or on the raised floor.

References

www.lbl.gov/cyber/systems/ssh.html
www.securityfocus.com/infocus/1806
www.cs.uchicago.edu/info/services/ssh

Analyzing Live Partition Mobility

Edit: This is taken for granted now. Some links no longer work.

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility. I saw it in action and I went from skeptic a believer in a matter of minutes.

Originally posted November 2007 by IBM Systems Magazine

I was in the Executive Briefing Center in Austin, Texas, recently for a technical briefing. It’s a beautiful facility, and if you can justify the time away from the office, I highly recommend scheduling some time with them in order to learn more about the latest offerings from IBM. From their Web site

“The IBM Executive Briefing Center in Austin, Texas, is a showcase for IBM System p server hardware and software offerings. Our main mission is to assist IBM customers and their marketing teams in learning about new IBM System p and IBM System Storage products and services. We provide tailored customer briefings and specialized marketing events.

“Customers from all over the world come to the Austin IBM Executive Briefing Center for the latest information on the IBM UNIX-based offerings. Here they can learn about the latest developments on the IBM System p and AIX 5L, the role of Linux and how to take advantage of the strengths of our various UNIX-capable IBM systems as they deploy mission-critical applications. Companies interested in On Demand Business capabilities also find IBM System p offers some of the most advanced self-management features for UNIX servers on the market today.”

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility.

I saw it in action and I went from skeptic a believer in a matter of minutes. At the beginning, I kept saying things like, “This whole operation will take forever.” “The end users are going to see a disruption.” “There has to be some pain involved with this solution.” Then they ran the demo.

The presenters had two POWER6 System p 570 machines connected to the hardware-management console (HMC). They started a tool that simulated a workload on one of the machines. They kicked off the partition-mobility process. It was fast, and it was seamless. The workload moved from the source frame to the target frame. Then they showed how they could move it from the target frame back to the original source frame. They said they could move that partition back and forth all day long. (Ask your business partner or IBM sales representative to see a copy of the demo. There’s a Flash-based demo that was recorded to show customers a demo. I’m still waiting for it to show up on YouTube.)

The only pain that I can see with this solution is that the entire partition that you want to move must be virtualized. You must use a virtual I/O (VIO) server and boot your partition from shared disk that’s presented by that VIO server, typically a storage-area network (SAN) logical unit number (LUN). You must use a shared Ethernet adapter. All of your storage must be virtualized and shared between the VIO servers. Both machines must be on the same subnet and share the same HMC. You also must be running on the new POWER6 hardware with a supported OS.

Once you get everything set up, and hit the button to move the partition, it all goes pretty quickly. Since it’s going to move a ton of data over the network (it has to copy a running partition from one frame to another), they suggest that you be running on Gigabit Ethernet and not 100 Megabit Ethernet.

I can think of a few scenarios where this capability would be useful:

The next time errpt shows me I have a sysplanar error. I call support and they confirm that we have to replace a part (which usually requires a system power down). I just schedule the CE to come do the work during the day. Assuming I have my virtualization in place and a suitable machine to move my workload to, I just move my partition over to the other hardware while the repair is being carried out. No calling around the business asking for maintenance windows. No doing repairs at 1 a.m. on a Sunday. We can now do the work whenever we want as the business will see no disruption at all.

Maybe I can run my workload just fine for most of the time on a smaller machine, but at certain times (i.e., month end), I’d rather run the application on a faster processor or a beefier machine that’s sitting in the computer room. Move the partition over to finish running a large month-end job, then move it back when the processing completes.

Maybe it’s time to upgrade your hardware. Bring in your new machine, set up your VIO server, move the partition to your new hardware and decommission your old hardware. Your business won’t even know what happened, but will wonder why the response time is so much better.

What happens if you’re trying to move a partition and your target machine blows up? If the workload hasn’t completely moved, the operation aborts and you continue running on your source machine.

This technology isn’t a substitute for High Availability Cluster Multi-Processing (HACMP) or any kind of disaster-recovery situation. This entire operation assumes both machines are up and running, and resources are available on your target machine to handle your partition’s needs. Planning will be required.

This will be a tool that I will be very happy to recommend to customers.

Tips for Gaining Practical Systems Administrator Knowledge

Edit: Winning this contest opened doors for me.

When an individual seeks additional experience, the whole company may benefit as a result.

Originally posted April 2007 by IBM Systems Magazine

Note: As part of a collaboration between PowerAIX.org and IBM Systems Magazine, guest writers were invited to submit Tips & Techniques articles to be considered for publication. A panel decided Rob McNelly’s column, seen here, best met the contest’s criteria.

I work with an intern. He goes to school and comes to the office when he’s not studying for tests, working on homework or going to class. It’s fair to say that we subject him to some good-natured abuse. For example, we send him to the computer room to look up serial numbers or to verify that network cables are plugged in. When I ask him why he puts up with it, he tells me he’s grateful for the opportunity and will happily do anything in order to gain experience.

How else do entry-level people get their start in the industry? When I look at job listings I see plenty of opportunities for senior-level administrators with years of experience. I don’t see the same opportunities for novices that I once did. There seem to be fewer openings for people to start out on a help desk or in operations and then move their way up. It still happens, but many of those lower-skilled jobs are now being handled remotely from overseas.

Seek Practical Experience

Besides working as a paid intern, another method I’ve seen people use to gain practical experience is to get some older hardware from eBay and use that as their test lab. You don’t need the latest System p5* 595 server to learn how to get around the AIX* OS. An old machine might be slower, but it’s just fine to practice loading patches, getting used to working with the logical volume manager and learning the differences between AIX and other flavors of UNIX*. Mixing older RS/6000* machines with some older PCs running Linux* can give anyone a good understanding and exposure to UNIX without actually learning it on the job or spending a great deal of money.

People can also download Redbooks from IBM, and use those study guides to learn more. If they then get involved with a local user group, they can make connections with people who are usually willing to share their knowledge. Eventually, they have some basic knowledge, and can hope to land a position as a junior-level administrator.

This initiative to learn outside of work hours can prove invaluable. I know that if I interview someone who tells me he has little hands-on experience working in a large datacenter, but he’s shown that he’s ambitious enough to study and learn what he can on his own, I’m willing to take the chance that I can teach him the finer points of what he needs to know to do the job. Give me someone with a good attitude and a desire to learn, and he or she can usually be taught what’s necessary to be productive in my environment.

Senior-level administrators can give back by writing articles and answering questions. Personally, I’ve found some irc channels and some usenet groups that I respond to if I have time. If we want more people to learn about the benefits of using the AIX OS, then we should be willing to help them when they run into problems. Many people run Linux because it’s relatively easy to obtain and install. They know that they can go online and easily get help when they run into problems. That same type of community should be encouraged around the AIX world as well. The following are some tips that I’ve found helpful. Hopefully our intern finds them helpful as well.

Migrating Machines

When migrating machines to new hardware in the good old days, I would make a mksysb tape and take that tape over to the new server I was building. I would boot from the AIX CD, select that mksysb image and restore it to my new machine. As time goes on, I find it less common to see newer hardware equipped with tape drives. Much of my server cloning these days occurs using the network. Two tools that I rely on are NIM and Storix. I create my mksysb and move it to my NIM server or use the Storix Network Administrator GUI and create a backup image of my machine to my backup server. In either case, I just boot the machine that I want to overwrite, set up the correct network settings and install the image over the network. This can be a problem in a disaster-recovery situation if you haven’t made sure that these backup images are available offsite, but for day-to-day system imaging I’ve found both methods to be useful.

Sorting Slots

I know that some people have issues when looking at the back of a p595 server. It can be a chore when you want to know which slot is which. This can be important when creating several LPARs on the machine. You want to keep track of which Fibre card and which network card goes with which LPAR. Anything looks complicated until someone shows you how it works.

First, find the serial number of the drawers on the machine, as this is the information that’s displayed on the hardware-management console (HMC) and we’re trying to correlate the physical slot to what’s displayed on the HMC. I used a flashlight and looked at the left side of the front of my example machine. It has two drawers, in this case 9920C6A and 9920C7V.

When you go to the back of the machine, start counting your I/O cards from left to right. There will be four cards, a card you ignore, then six cards. These will be your first 10 slots on the drawer’s left side. There are four more cards, a card you ignore, and six more cards, making up the 10 slots on the drawer’s right side.

These slot numbers correspond with the slots you see when you select required and desired I/O components from the HMC. This I/O drawer had the following selections that I could choose from on the HMC (P1 is the left side of the top drawer, or the first 10 slots. P2 is the right side of the top drawer, or the second 10 slots.):

  • 9920C6A-P1
  • 9920C6A-P2

When I looked at the drawer, going from left to right, I wrote down:

  • C01 is an Ethernet card
  • C02 is a Fibre card
  • C03 and C04 are empty
  • C05 is a Fibre card
  • C06 is an Ethernet card
  • C07 is empty
  • C08 is a Fibre card
  • C09 is empty
  • C10 is a SCSI bus controller

So, I assigned C01, C02 and C05 from 9920C6A-P1 to this LPAR. If I continue the exercise and go to the right side of the top drawer, I start over with C01 and note which type of cards were in which slot. I then continue to do the same thing on the bottom drawer. In this way, I know exactly which cards are in which slot, and it’s simple to assign them to the particular LPAR in which you want them. For redundancy, I’ve heard recommendations state that you take one Fibre card from your top drawer and another Fibre card from your bottom drawer. This way you will still have a path to the SAN if you were to lose one of the drawers.

A Fresh Perspective

Another thing I like to do when I bring in new employees is to have them look for what we’re doing wrong. The new employee has fresh eyes. They don’t know that “this is the way we always do things around here.” They see a piece of documentation, tool or a process, and can question why things are done.

In some cases, there are perfectly good reasons why things are being done a certain way and you can explain them. In other cases, there’s no good reason, other than it’s the way things have always been done. Instead of trying to get them up to speed and make them do it the company way, let them ask you to defend why the company does things this way in the first place. Maybe their previous employer had a much better method that they used to get things done. This is a great time to learn from each other to improve the environment.

We can try to help make a difference in a newer AIX administrator’s career. However, that doesn’t mean we’re the fountain of all information. I’ve found a time or two when an intern has asked why we do things a certain way, and I didn’t have a good answer. I told him to figure out a better way, and come back and inform the group. This helps him with his knowledge of where to look for information and it has helped us all think about processes and procedures that we’ve taken for granted.

The intern’s learning is an example of the boon a little practical knowledge can make. When an individual seeks additional experience, the whole company may benefit as a result.

Establishing Good Server Build Standards, Continued

Edit: Still useful information.

Standards and checklists can take effort to maintain but, once in place, all of your builds look identical.

Originally posted January 2007 by IBM Systems Magazine

Note: This is the second of a two-part article series. The first part appeared in the December, 2006 EXTRA.

In my first part of this article series, I explained the importance of establishing good server build standards, along with a mechanism to enforce those standards. I also explained the importance of putting in place a checklist to ensure the standards are met in a consistent manner. This second article installment looks further into server build standards.

The Benefits of a Good Server Build

Standards and checklists can take a great deal of effort to maintain but, once in place, all of your builds look identical. The actual time it takes to deploy a server is minimized. Administrators are then free to work on other production issues instead of spending a great deal of time loading machines. When you’re only deploying a server once in a while, this might not be a big deal. But when you start to deploy hundreds of machines, your application support teams are going to appreciate your consistency. If two administrators are building machines – and each machine looks slightly different – your end users won’t know exactly what to expect when a machine gets turned over to them. They spend time asking for corrections and additions to the machine that should have been completed when the server was loaded. This makes your team’s work product look shoddy, as you’re not consistently delivering the same end product.

People who use the machines should come to expect the new server builds will have the same tools and settings as all the other machines in the environment. When the database team is ready to load their software, and find file systems or userids missing, it makes their job more difficult, as they’re not sure what to expect when they first get access to the machine.

Be Consistent

Besides the server builds, the actual carving up of LPARs should be consistent. Sure, some machines might be using “capacity on demand,” and some might want to run capped or uncapped, but when these decisions are made, document them so people know what to expect in the different profiles. If you explain why you chose the setting, people are less likely to change it. Likewise, if you tell them why you chose shared processors and why the minimum and maximum number of processors look the way they do, they’ll be less likely to mess with it.

So how do you get people to actually follow the standards and documentation you’ve been maintaining? Make sure it’s easy to follow. The new person who joins your team should be able to quickly get up to speed on what you’re doing and why. This will make them a more effective member of the team in less time. When people make mistakes, or blatantly ignore the standards, call them out on it; maybe privately at first, but if it continues, I think the whole team should be made aware that a problem exists. Maybe there’s a good reason the standards aren’t being followed. Maybe they’ve been to class and learned something new. If this is the case, there should be some discussion and consensus as to how the documentation and standards should change.

The documentation your team maintains should obviously focus on more than just server builds. The more quality procedures and documentation you can create, the easier your job is going to be over the long term. If you have a well written procedure, you can easily remind yourself of what you did six months ago, and which files you need to change and which commands you need to run to make changes to the system today.

Some members of the team have stronger documentation skills than others. Some members of the team may have very strong technical skills, but their writing skills may not be as effective. This shouldn’t automatically get people off the hook, but if they really don’t have a good grasp on the language, or just have problems getting documentation onto paper, maybe they need to work together with someone who has more skill in that area. Maybe there needs to be a dedicated resource that works on creating and maintaining the documentation. Obviously every team will be different. The key to making effective use of documentation is to make it easily available (especially when on call or working remotely) and making it easily searchable.

When you are able to quickly and easily search for documentation, and everyone knows exactly where it is, it is more apt to be used. Instead of reinventing the wheel, people should be able to quickly find the material they need to do their jobs. In some cases, a very brief listing of necessary commands may be very helpful in troubleshooting a problem. It’s also helpful to have a good overview of common problems, how things should behave normally, and where to go for further information if they’re still having problems.

Once the documentation and the golden image are in place, your team can start looking for other ways to automate and enhance the environment they work in. There are always better ways to do things. Just because something is the way things are done today doesn’t mean it’s the best way to get things done. With an open mind, and a fresh set of eyes, sometimes we can more easily see the things around us that could use improvement. Then it’s just a question of making the time to make things happen. Sometimes it requires small steps, but with a clear vision of how things should look, we can make the necessary adjustments to make things better.

Establishing Good Server Build Standards

Edit: This can be less of an issue when things are more automated, but it is still worth consideration.

Server build standards simplifies the process of supporting IT environments.

Originally posted December 2006 by IBM Systems Magazine

Note: This is the first of a two-part article series. The second part will appear in the January EXTRA.

There are still small organizations with one or two full time IT professionals working for them. They may find they are able to make things work with a minimum of documentation or procedures. Their environment may be small enough that they can keep it all in their heads with no real need for formal documentation or procedures. As they continue to grow, however, they may find that formal processes will help them, as well as the additional staff that they bring on board. Eventually, they may grow to a point where this documentation is a must.

The other day I was shutting down a logical partition (LPAR) that a co-worker had created on a POWER5 machine. A member of the application support team had requested we shut down the LPAR as some changes had been made, and they wanted to verify everything would come up cleanly and automatically after a reboot. We decided to take advantage of the outage and change a setting in the profile and restart it. To our surprise, after the LPAR finished its shutdown, the whole frame powered off. When you go into the HMC and right-click on the system, and select properties, you see the managed system property dialog. On the general tab, there’s a check-box that tells the machine to power off the frame after all logical partitions are powered off. During the initial build, this setting was selected, and our quick reboot turned into a much longer affair as the whole frame had to power back up before we were able to activate our partition. This profile setting had not been communicated to anyone, and we had mistakenly assumed it was set up like the other machines in our environment.

This scenario could have been avoided had there been good server build standards in place, along with a mechanism to enforce those standards. Our problem wasn’t that the option was selected, but that there was no good documentation in place that specified exactly what each setting should look like and why. Someone saw a setting and made their best guess as to what that setting should be, and then that decision was not communicated to the rest of the team. One of the problems with having a large team is people can make decisions like these without letting others know what has taken place. Unless they have told other people what they’re doing, other members of the team might assume the machine will behave one way when, in actuality, it’s be set up in another way.

Making A List, and Checking It Twice

Checklists and documentation are great, as long as people are actually doing what they are supposed to. Some shops have a senior administrator write the checklist, and a junior administrator build the machine, while another verifies the build was done correctly. A problem can crop up when a senior administrator asks for something in the checklist without explaining the thinking behind it. He understands why he has asked for some setting to be made, or some step to be taken, but nobody else knows why it’s there. The documentation should include what needs to change, but also why it needs to be changed. If it’s clear why changes are made, people are more apt to actually follow through and make sure all the servers are consistent throughout the environment. If the answer they get is to just do it, they might be less likely to bother with it since they don’t understand it anyway. The person actually building the machine might not think it’s important to follow through on, which leads to the team thinking a server is being built one way, when the finished product does not actually look the way the team as a whole thought it would.

The team also needs to be sure to keep on top of the checklist, as this is a living document that will be in a constant state of flux. As time goes on, if that checklist is not kept up to date, things can change with the operating system and maintenance level patches that either make that setting obsolete, or the setting starts causing problems instead of fixing them. The decision could have been made to deploy new tools, or change where logfiles go, or standard jobs that run out of cron. If these changes are not continually added to the checklist, the new server builds no longer sync with those in production. This is equally important when decommissioning a machine. There are steps that must be taken, and other groups that need to be notified. The network team might need to reclaim network cables and ports. The SAN team may need to reclaim disk and fiber cables. The facilities team may need to know this power is no longer required on the raised floor. To put it simply: A checklist that’s followed can ensure these steps get completed. Some smaller shops may not have dedicated teams to do these things, in which case it might just be a case of reminding the administrators they need to take care of these steps.

Another issue can crop up when the verifier is catching problems with the new server builds, but isn’t updating the documentation to help clarify settings that need to be made. If the verifier is consistently seeing people forgetting to change a setting, they should communicate what’s happening to the whole team, why it needs to happen during the server build, and then update the documentation to more clearly explain what needs to be done during the initial server build. What’s the point of a verifier catching problems all the time, but then not making sure the documentation is updated to avoid these problems from cropping up in the future?

Having these standards makes supporting the machines much easier, as all of the machines look the same. Troubleshooting a standard build is much easier, as you know what filesystems to expect, how volume groups are set up, where the logs should be, what /etc/tunables/nextboot looks like, and so on. Building them becomes very easy, especially with the help of a golden image. I think it’s essential to have infrastructure hardware you can use to test your standard image. This hardware can be dedicated to the infrastructure or an LPAR on a frame but, in either case, you set up your standard image to look exactly as you want all of your new servers to look, and make a mksysb of it. Then use that on your NIM server to do your standard loads. Instead of building from CD, or doing a partial NIM load with manual tasks to be done after the load, keep your golden image up to date and use that instead. Keep the manual tasks that need to happen after the server build to an absolute minimum, which will keep the inconsistencies to a minimum as well. When patches come out, or new tools need to be added to your toolbox, make sure – besides making that change to the production machines – you’re updating your golden image and creating a more current mksysb.

In next month’s article, I’ll further explore the benefits of establishing good server build standards and checklists.

Of Cubes, Offices and Remote Access Via VPN

Edit: I still believe this is true.

A system administrator’s take on getting the most from the work day.

Originally posted November 2006 by IBM Systems Magazine

Last month I looked at reasons why a VPN is a great idea for accessing your network when you are not in the office. This article examines issues I’ve encountered when working in a cube farm, and different methods I like to use when trying to get continuing education while training budgets continue to get squeezed.

When your cell phone goes off in the middle of the night and you find that a system is down and requires your attention, does your employer require you to get dressed and drive to your workplace to take care of the situation? In some environments, that answer is yes. For whatever reason, a VPN may not be allowed into the network and you must drive on site to resolve the issue. In other cases, you may have a hardware failure and no tools are available to remotely power machines on and off. Maybe you are having issues bringing up a console session remotely, and you have to drive on site. Generally, however, in most situations we are able to log in and resolve the issue without leaving the comfort of our homes.

Many companies encourage their employees to resolve issues from home as the response time is much quicker, and they hope the employee can quickly resolve the issue, get some sleep, and still be able to make it into the office for their regular hours during the day. However, the flexibility that these companies show during off hours often is not extended during daylight hours; the belief apparently being that an employee who they can’t see in the office must not actually be working.

I have worked in environments where you needed to be on site to mount tapes and to go to the users’ workstations to help them resolve computing issues they might be experiencing. There are also times that you need to be on a raised floor to actually access hardware, or you might be asked to attend a meeting in person. For the most part, much of the day-to-day work of a system administrator can be handled remotely.

When tasks are assigned to team members via a work queue, and when you are able to communicate with coworkers via e-mail and instant messaging (and a quick phone call to clarify things once in a while) there is no reason, in my opinion, to come on site every day. Some shops, however, want everyone to work in cubicles, and have everyone available during the same hours. They feel this will lead to more teaming and quicker responses from co-workers. What I’ve found in these situations is the opposite.

The Cacophony of the Cube Farm


It gets very noisy in a cube farm, and there is a great deal of socializing that takes place throughout the day. Some people try to solve the issue by isolating themselves with noise canceling headphones and hope that they can get some “heads down” time to work on issues. Instead of being part of the environment, they’re isolated and can’t hear what’s happening around them. People can still interrupt them by tapping them on the shoulder, but I find that it’s more efficient to contact them electronically instead of in person.

Cube farms easily lend themselves to walk-up requests from other employees who sit in the same building. Most organizations do their best to have change control and problem reporting tools to manage their environments. When coworkers try to short circuit the process and walk up to ask for a quick password reset or a failed login count reset, or to quickly take a look at something, it can cause problems.

Some people follow the process and open a ticket in the system, or they call the helpdesk. The helpdesk opens a ticket and assigns it to a work queue. The people who walk up to the cube bypass that whole process for a quick favor. It may not take very long to help them out, but it does cause issues. The person who granted the favor was interrupted and lost their concentration, and possibly stopped work on a high severity or mission critical situation.

The person who walked up also stopped what they were doing, walked over, waited to get your attention, and then waited while you worked on their problem. This prevented you from working on the problem you had already committed to getting done. There was no record in the system that this issue came up, which in some environments can lead to an under reporting of trouble tickets, which can cause management to believe that there are less requests being fulfilled than are actually occurring. When you ask them to go back and fill out a form or call the helpdesk, they can get upset that you did not immediately help them out. If you ask them to open a ticket after the fact, that becomes a hassle for them to take care of, and they have no real motivation to go back and take care of the paperwork, as their request has already been handled for them.

What I’ve found works better for me is to work remotely during the day. The interrupts still come in via instant messaging or e-mail, but I can control when I respond to them. During an event that requires immediate assistance, I can easily be paged or called on my cell phone. Just because an e-mail or an instant message comes in, that doesn’t mean I have to immediately stop what I’m working on in order to handle it. I can finish the task I’m working on, and when I reach a good stopping point, I can find out what the new request is. Depending on the severity of the request, and how long it will take, I can then prioritize when it will need my attention.

I also find that since my coworkers are not standing there waiting for me to respond, there is less time wasted by both parties. They send me an e-mail or instant message, and go on doing other things while waiting for me to respond. If it’s appropriate, I have them open a ticket and get it assigned to the correct team to work on it. For some reason, the request to have them open a ticket has been met with less hostility when I have done it over instant messaging versus a face-to-face discussion.

Offices Versus Cubicles


My next favorite place to work, if I must be onsite, is an actual office with a door that I can shut. Many companies have gravitated away from this arrangement due to the costs involved, but I think it bears some reconsideration. The noise levels in a shared office environment end up irritating a good portion of the employees. Office mates that use the phone can be heard up and down the row. Some employees want less light, some want more. Some want less noise, some want to listen to the radio and shout over the cubicle partitions to get their neighbor’s attention. All the background noise and the phone conversations make it very difficult to concentrate when working on problems.

There can be advantages to a shared work environment. When you overhear an issue that a coworker is working on, for example, you may be able to offer some help. Other times, it can be conducive to a quick off the cuff meeting with people. You can quickly look around and determine if someone is in the office that day. Some people thrive in a noisy environment, and it often all comes down to personal style and how people work best. I think many companies would be well served to offer options to their employees.

In discussing this topic with co workers, there are some who would refuse to work from home, since they may not feel disciplined enough to get work done in that environment and they would miss the interpersonal interaction that they currently enjoy. I’ve heard some say they would feel cooped up in an office and need the stimulation that comes from having their coworkers around. But, for some, the ability to work remote or to work in an actual office makes for a happier and far more productive employee.

Setting up work environments has gotten so bad at times, I have seen companies set up folding tables with a power outlet and a network switch and asked people to work in that space. I suppose for a quick ad hoc project, or a disaster recovery event, this may make some sense, but to ask people to work this way day in and day out seems almost inhumane. At least with a cubicle you have some semblance of walls, but in this arrangement employees are sitting shoulder to shoulder, and I honestly have no idea how they can even begin to think about getting things done.

Flexible Hours


Along with the ability to work remotely, I also enjoy the ability to work flexible hours. If you are working on projects, does it really matter what time of the day you work on them? I have enjoyed the flexibility of working in the morning, taking my kids to school, working more after that, taking a break around lunch time and going to the gym or out for a bike ride, then working more after that. I’ve found that I actually worked longer hours, but I didn’t mind since I was setting my own schedule and determining what time of day was most appropriate to work on the tasks at hand. Some people work better later in the evening, so why not let them work then?

Why be expected to work from 9 a.m. to 5 p.m. when 6 a.m. to 9 p.m. may work better for workers, with some breaks during the day to attend to personal matters? Some managers insist they can’t effectively supervise their employees if they don’t constantly have them around to monitor. I say this is nonsense; you can very easily tell if your employees are doing their job based on the feedback you get from people who are asking them to do work. Are they closing problem tickets? Are they finishing up the projects assigned to them? Are they attending their meetings and conference calls? Are they responsive to e-mail? If so, who cares what time of day or location the employee happened to be working from?

Training Time


Another difficult thing to do in a noisy environment is simply read and concentrate. With training budgets getting cut, many employees find that, to keep their skills current, they must constantly read and try things on their own in test environments. IBM Redbooks and other online documentation may be all the exposure that people get with topics like virtualization or HACMP or VIO servers. Having a quiet place to study, while having access to a test machine, can do wonders as far as training goes, although it doesn’t offer the same depth you can get when you are able to go to a weeklong instructor-led class. But, it’s usually better than computer-based training (CBT), in my opinion.

Hands-on lab-based training should be built into the job. The opportunities should be made available to those who want to keep their skills current, even if the training budget isn’t there. Companies should make sure employees are given the time to study these materials, even if the funding isn’t available to allow them to go to formal classes.

Many companies have told me they have given me an unlimited license to use all of the CBT courses I could take, at a huge cost savings to the company. When I looked at the course catalog, it was definitely a case of them getting what they paid for. Many times, the classes contained older material, and it was usually at an inappropriate skill level. At least with Redbooks and a test machine, you can quickly find out if you are able to get the machine to do what you think it should.

Employee Retention is Key


They say the cost of employee turnover can be significant. Instead of spending all the money to recruit and train someone new, I am always amazed that a company is not more interested in retaining the talent that they already have. The environment where people spend many of their waking hours will have an impact on whether companies are able to recruit new talent, and retain the talent they already have.

By taking steps to make the work environment less distracting, companies will likely realize a more productive workforce. If this means providing employees with their own office, then it will be money well spent. If this means letting them work remotely, that will also be a good solution. Be sure to encourage them to schedule the time in their day to read and study and try things out in a lab setting. As they gain more skill and have a quiet environment to work in, the company will find an energized and motivated pool of talent to call upon to implement their next project.

Advice for the Lazy Administrator

Edit: Still good stuff.

Originally posted September 2006 by IBM Systems Magazine

I always liked the saying that “a lazy computer operator is a good computer operator.” Many operators are always looking for ways to practically automate themselves out of a job. For them, the reasoning goes: “why should we be manually doing things, if the machine can do them instead?”

A few hours spent writing a script or tool can pay for itself very quickly by freeing up the operator’s time to perform other tasks. If you set up your script and crontab entry correctly, you can let your machine remember to take care of the mundane tasks that come up while you focus on more important things, with no more forgetting to run an important report or job. Sadly, even the best operator with the most amazing scripts and training will need help sometimes, at which point it’s time for the page out.

In our jobs as system administrators, we know we’re going to get called out during off hours to work on issues. File systems fill up, other support teams forget their passwords or lock themselves out of their accounts at 2 a.m., hardware breaks, applications crash. As much as we would love to see a lights out data center where no humans ever touch machines that take care of themselves, the reality is that someone needs to be able to fix things when they go wrong.

We hate the late night calls, but we cope with them the best we can. Hopefully management appreciates the fact that many of us have families and lives outside of work. We are not machines, or part of the data center. We can’t be expected to function all day at work, then all night after getting called out. It’s difficult to get back to sleep after getting called out, and it impacts our performance on the job the day after we are called or, worse, it ruins our weekends. However, our expertise and knowledge are required to keep the business running smoothly with a minimum of outages, which is all factored into our salaries.

I have seen different methods used, but it’s basically the same. Each person on the team gets assigned a week at a time, with some jockeying around to try to schedule our on-call weeks to avoid holidays, and usually people can work it all out at the team level. In one example, I even saw cash exchanging hands to ensure that one individual was able to skip his week. Whatever method is used, the next question revolves around how long you’re on call. Is it 5 p.m. – 8 a.m. M-F and all day Saturday and Sunday? Is it 24 x 7 Monday through Monday? Does the pager or cell phone get handed off on a Wednesday? Do we use individual cell phones or a team cell phone? They are all answers to the same question, and you have to find the right balance for the number of calls you deal with off-shift and the on-call workload during the day.

On call rotation is the bane of our existence, but we can take steps to reduce the frequency of the late night wake up calls. If we have stable machines with good monitoring tools and scripts in place, that can go a long way towards eliminating unnecessary callouts. Having a well-trained first-level support, operations, or help desk staff can also help eliminate call outs.

In a perfect world, a monitoring tool like NetView or OpenView or Netcool is in place monitoring the servers, where all of the configurations are up to date and all of the critical processes and filesystems are being monitored. When something goes bad, operations sees the alert, and they have good documentation, procedures and training in place to do some troubleshooting. Hopefully they’ve been on the job for a while and know what things are normal in this environment, and they can quickly identify when there is a problem. For routine problems, you have given them the necessary authority (via sudo) or written scripts for them to use to reset a password, reset a user’s failed login count, or even add space to a filesystem if necessary.

I spent time in operations early in my career, and learned a great deal from that opportunity. I remember it was a great stepping stone: many of my coworkers got their start working 2nd and 3rd shift in operations positions. This was a great training ground, but all of the good operators were quickly “stolen” to come work in 2nd and 3rd level support areas.

If another support team needs to get involved, operations pages them and manages the call. Then the inevitable happens: someone needs to run something as root, or they need our help looking at topas or nmon, etc. Hopefully they were granted sudo access to start and stop their applications, but sometimes things just are not working right, and that’s when they page the system administrator. Ideally, by the time we’ve been paged, first level support has done a good job with initial problem determination, the correct support team has been engaged, and by the time they get to us, they know what they need for us to do and it will be a quick call and we can go back to sleep.

Sometimes, it’s not a quick call, where nobody knows what’s wrong and they’re looking to us to help them determine if anything is wrong with the base operating system. In a previous job, I used a tool that kept a baseline snapshot of what the system should look like normally. It would know what filesystems should be mounted, what the network looked like, which applications were running and saved that information to a file. When run on the system in its abnormal state, it was easy to see what was not running, which made finding a problem very simple. Sometimes, however, this did not find anything either, which is where having a good record of all the calls that have been worked on by the on-call team is a godsend. A quick search for the hostname would bring back hits that could give a clue as to problems others on your team had encountered, and what they had done to solve them.

At some point, the problem will be solved, everyone will say it’s running fine, and everyone will hang up from the phone call (or instant messaging chat, depending on the situation) and go to bed. Hopefully, as the call was ongoing, you were keeping good notes and updating your on call database with the information that will be helpful to others to solve the problem in the future. Just typing in “fixed it” in the on call record will not help the next guy who gets called on this issue nine months down the road.

Hopefully you are having team meetings, and in these meetings you are going over the problems your team has faced during the last week of being on call, and the solutions that you used to solve them. There should be some discussion as to whether this is the best way to solve it in the future, and whether any follow-up needs to happen. Do you need to write some tools, or expand some space in a file system, or to educate some users or operations staff? Perhaps you need to give people more sudo access so they can do their jobs without bothering the system admin team.

Over time, the process can become so ingrained that the calls decrease to a very manageable level. Everyone will be happier, the users will have machines that don’t go down, and if/when they do, operations is the first team to know about it. The machines can be proactively managed which will save the company from unnecessary downtime.

The Benefits of Working Remotely Via VPN

Edit: Hopefully this problem has been solved by now.

Originally posted October 2006 by IBM Systems Magazine

It’s 2 a.m., and you’ve just been paged. Do you have an easy way to get into your network, or is the pain of waking up going to be compounded by frustrations associated with dialing into work? In the good old days, I can remember dialing into work with a modem in order to get work done. It was slow, but there weren’t any alternatives. I just thought I was lucky I could avoid the drive back onsite to fix something in the middle of the night.

Sometimes I would use a package like Symantec’s pcAnywhere to remotely control a PC that had been left powered on in the office. We would use this same type of solution for our road warriors, who would dial in from a hotel room and do their best to get their e-mail or reports from the server. It wasn’t ideal, but it was one of the best solutions available at the time. Some employers still use solutions like pcAnywhere, gotomypc.com, Citrix, etc. These approaches can be useful for non-technical users, or for people that need to use desktops that are locked down. However, with the advent of the ability to tunnel over a virtual private network (VPN) into the corporate network, the need to use remote control software should lessen, especially for the technical support staff members who happen to be remote.

The need to be remote might not even be related to a call out in the middle of the night. You might have employees who travel and need to access the network from a cab, airport or hotel. You may be interested in offering the ability for your employees to work remotely and require them to be in the office less often. You may have an employee who is too sick to come into the office, but not so sick that they cannot take some Dayquil and do some work from home. You may have an employee with a sick child who is unable to go to daycare. Instead of asking them to take a sick day to care for their child, hopefully you have the tools and policies in place to allow them to work remotely while their child is resting. All of these situations end up being productivity gains for the employer. Instead of idle time during which an employee is unable to connect to the office and get work done, a simple VPN connection into the office gives the employee the opportunity to get things done from wherever they are, using the tools they’re accustomed to.

I have known customers that outfit their employees with laptops that allow them to work from home, but then cripple them with a Citrix solution, or another remote access method that doesn’t allow them to use the tools that are on their machines. It’s much easier for the employee to use the applications that are loaded on the laptop, in the same way that they are used in the office. When you put another virtual desktop in the middle of things, it complicates life unnecessarily compared to allowing this machine to be just another node on the network.

Security Considerations and Precautions

There are security considerations and precautions that need to be taken when thinking about a VPN. Nobody wants to deploy a solution that allows their employees in, but also allows non-employees to have unauthorized access. We must do our best to mitigate these risks, while still allowing trusted people to have the resources to do their jobs. There are going to be some networks that don’t allow any traffic in or out of them from the outside, and obviously this discussion is not intended for them. There are going to be situations where sensitive information exists where the risk of disclosure outweighs any benefits of allowing remote access to anyone.

In many instances, providing employees with network access is a benefit to the employee and the employer. The time it will take to wait for an employee to get dressed and drive in (especially when they live great distances away) can be an unacceptable delay when a critical application goes down during the night. Instead of waiting for them to drive on-site, provide the right tools to get the job done remotely.

An ideal world is one where you can work seamlessly from wherever you happen to be. Cellular broadband networks, 802.11 wireless networks, and wired broadband networks in the home, coupled with a decent VPN connection, has gotten us to the point where it really doesn’t matter where an employee physically resides in order to get the work done. We can see the truth of that statement when we start to see the globalization of the technical support work force. Many organizations are taking advantage of the benefits of employees working from anywhere, including other countries. It would be ridiculous to ask an employee to work remotely from overseas over a Citrix connection that has a 15-minute inactivity timeout. It should be just as ridiculous to ask a local employee to use this type of connection to troubleshoot and resolve issues with servers.

Using What You’re Familiar With

When you need to connect to your hardware management console (HMC) from home, it’s nice to run WebSM the same way you do in your office. You could run Secure Shell (ssh) into the HMC as hmcroot, and run vtmenu. From there, you enter the correct number for the managed system you want to use, and then type the number of the LPAR you want to open a console window for. This is fine, but sometimes you need to use the GUI to do work on the profiles or to stop and start LPARs.

Why not just use the tools and methods you’re familiar with and use in the office? I’ve worked both ways, and being able to suspend your laptop, go somewhere else, restart it, connect to the VPN, and pick up right where you left off by using virtual network computing (VNC) is a great way to work. If you have your instant messenger running in a VNC session, it can be so seamless that your coworkers may not even realize that you have moved physical locations – they just noticed that you did not respond to them for a while, and you did not have to interrupt the flow of the chat session that was in progress.

Being asked to use a Citrix-like solution that is clunky by comparison (especially if there are issues with the Citrix connection being lost, or timing out too quickly) can quickly make employees not as eager to take care of problems from home. Instead of quickly and easily connecting to the network and solving the problem, you have people wasting time trying to use a difficult solution.

When I use a seamless VPN connection, I actually find that I work more hours. It’s so easy to get online, I constantly find myself doing work before and after my hours on-site, and even doing things on the weekends. Checking e-mail, looking at server health-check information and checking the on-call pager logs are all so easy to do, I figure why not spend a few minutes and do them. When I contrast that with a solution that’s painful to use, I see that people are not nearly as interested in getting online and getting things done, and things are only done as a last resort in a situation where they have to get online to fix something that’s broken.

VPN Options

I have used commercial VPN offerings, including the AT&T network client, the IBM WebSphere Everyplace Connection Manager (WECM), and open source offerings including OpenVPN. There are pros and cons with all of them, but the main thing that they shared was the capability to make your remote connection replicate the look and feel of your office environment while you’re away from the office.

One aspect of the AT&T client that I liked was the capability to go between using dial-up access when you could not find broadband access, or going over a broadband connection when you could. Obviously, the speed differential was tremendous, but the capability to dial in when there is no other way to make a connection was very helpful while traveling.

When I used a WECM gateway, I found I was able to be connected on a wireless network, suspend my laptop, go to a wired network, take my laptop out of hibernation, and have the network connections re-establish themselves over the new connection. This made things even more seamless and transparent to the end user.

As this IBM Web site explains: “IBM WebSphere Everyplace Connection Manager (WECM) Version 5.1 allows enterprises to efficiently extend existing applications to mobile workers over many different wireless and wireline networks. It allows users with different application needs to select the wireless network that best suits their situation. It also supports seamless roaming between different networks. WECM V5.1 can be used by service providers to produce highly encrypted, optimized solutions for their enterprise customers.”

“WECM V5.1 is a distributed, scalable, multipurpose communications platform designed to optimize bandwidth, help reduce costs, and help ensure security. It creates a mobile VPN that encrypts data over vulnerable wireless LAN and wireless WAN connections. It integrates an exhaustive list of standard IP and non-IP wireless bearer networks, server hardware, device operating systems, and mobile security protocols. Support for Windows Mobile V5 devices clients has now been added.”

Both of these solutions cost money, so a low cost method is to set up a Linux machine as an OpenVPN server. A full discussion is beyond the scope of this article, but more information can be found at openvpn.net. From that site’s main page: “OpenVPN is a full-featured SSL VPN solution that can accommodate a wide range of configurations, including remote access, site-to-site VPNs, WiFi security, and enterprise-scale remote access solutions with load balancing, failover, and fine-grained access-controls.”

“OpenVPN implements OSI layer 2 or 3 secure network extension using the industry standard SSL/TLS protocol, supports flexible client authentication methods based on certificates, smart cards, and/or 2-factor authentication, and allows user or group-specific access control policies using firewall rules applied to the VPN virtual interface. OpenVPN is not a Web application proxy and does not operate through a Web browser.”

The competition for talent in today’s IT world is fierce. During the interview process, when a potential candidate asks you about the solution that you use for working from home and on call support connectivity, hopefully you can give them the right answer. With the right infrastructure in place, it may even be possible to recruit talent and allow them to continue living where they are, instead of asking them to relocate.

Most organizations already have good solutions in place, but it never hurts to revisit the topic, and see if there is room for improvement where you work.

Real World Disaster Recovery

Edit: One of my favorite articles.

Originally posted June 2006 by IBM Systems Magazine

Disaster recovery (D/R) planning and testing has been a large part of my career. I’ve never forgotten my first computer-operations position and the manager who showed me a cartoon of two guys living on the street. One turned and said to the other, “I did a good job, but I forgot to take good backups.”

I’ve been involved in D/R exercises for a variety of customers, and I was also peripherally involved on a D/R event that happened after Hurricane Katrina. There’s a big difference between planned and unplanned D/R events.

Does your datacenter have the right procedures and equipment in place to recover your business from a disaster? Can your business survive extended downtime without your computing resources? Is your company prepared for a planned D/R event? What about an unplanned event? I’ve helped customers recover from both types of events. This article provides a place to start when considering D/R preparations for your organization.

Comfortable Circumstances

There’s a big difference between planned and unplanned D/R events. After traveling to an IBM* Business Continuity and Recovery Services (BCRS) center, I helped restore 20 AIX* machines during the 72 allocated hours. I was well-rested and well-fed. We knew the objectives ahead of time, and we took turns working and resting. Additionally, we didn’t restore all of the servers in the environment, but hand-picked a cross-section of them. We modified, reviewed and tested our recovery documentation before we made the trip, and we made sure there was enough boot media to do all the restores simultaneously – and even cut an extra set of backup tapes just in case.

We had a few minor glitches along the way, but we were satisfied that we could recover our environment. However, these results must be taken with a grain of salt, as this whole event was executed under ideal circumstances.

In another exercise, I didn’t have to travel anywhere; I went to the BCRS suite at my normal IBM site and spent the day doing a mock D/R exercise. We were done within 12 hours. We had a few minor problems, but the team agreed that we could recover the environment in the event of an actual disaster. Again, I was well-rested and well-fed.

Katrina Circumstances

As Hurricane Katrina was about to make landfall, e-mails went out asking for volunteers to help with customer-recovery efforts. I submitted my name, but there were plenty of volunteers, so I wasn’t needed. A few weeks later, the AIX admin that had been working on the recovery got sick, and I was asked to travel onsite to help.

Although I can’t compare the little bit that I did with the Herculean efforts that were made before I arrived, I was able to observe some things that might be useful during your planning.

A real D/R was much different from the tests that I’d been involved with in the past. The people worked around the clock in cramped quarters, getting very little sleep. There were too many people on the raised floor, and there weren’t enough LAN drops for the technicians to be on the network simultaneously.

The equipment this customer was using needed to be refreshed, so there was an equipment refresh along with a data recovery, which posed additional problems during the environment recovery. Fortunately, the customer had a hot backup site where the company could continue operations while this new environment was being built. However, as is often the case, the hot backup site had older, less powerful hardware. It was operational – but barely – and we wanted to get another primary site running quickly.

One of the obvious methods of disaster preparation is to have a backup site that you can use if your primary location goes down. Years ago, I worked for a company that had three sites taking inbound phone calls. They had identical copies of the database running simultaneously on three different machines. They could switch over to the other sites as needed. During the time I was there, we had issues (snow, rain, power, hardware, etc.) that necessitated a switch over to a remote location. We needed to bring down two sites and temporarily run the whole operation on a single computer. This was quite a luxury, but the needs of the business demanded that was the route to be taken. This might be something to consider as you assess your needs.

Leadership must be established before beginning – either during a test or a real disaster. Who’s in charge: the IBM D/R coordinator, the customer or the technicians? And which technicians are driving the project: the administrators from the customer site, consultants or other technicians? All of these issues should be clearly defined so people can work on the task at hand and avoid any potential political issues.

The Importance of Backups

During my time with the Katrina customer recovery, I found out that one of the customer’s administrators had to be let go. It turns out that he’d been doing a great job with his backup jobs. He ran incremental backups every night, and they ran quickly. However, nobody knew how many years ago he’d taken his last full backup. The backup tapes were useless. Fortunately, their datacenter wasn’t flooded and, after the water receded, they were able to recover some of their hardware and data.

Are your backups running? Are you backing up the right data? Have you tested a restore? One of the lessons we learned during a recovery exercise was that our mksysb restore took much longer than our backup. Another lesson we learned was that sysback tapes may or may not boot on different hardware. Does your D/R site/backup site have identical hardware? Does your D/R contract guarantee what hardware will be available to you? Do you even have a D/R contract?

Personnel Issues

We had personnel working on this project who were from the original customer location and knew how to rebuild the machines. However, they were somewhat distracted as they worried about housing and feeding their families and finding out what had happened to their property back home. Some were driving hundreds of miles to go home on the weekend – cleaning up what they could – and then making the long drive back to the recovery site. Can you give your employees the needed time away from the recovery so they can attend to their personal needs? What if your employees simply aren’t available to answer questions? Will you be able to recover?

Other Issues

Other issues that came up involved lodging, food and transportation. FEMA was booking hotel rooms for firefighters and other rescue workers, so finding places to stay was a challenge. For a time, people were working around the clock in rotating shifts. Coordinating hotel rooms and meals was a full-time job. Instead of wasting time looking for food, the support staff brought meals in and everyone came to the conference room to eat.

You may remember that Hurricane Rita was the next to arrive, so there were fresh worries about what this storm might do, and gasoline shortages started to occur. After you’ve survived the initial disaster, will you be able to continue with operations? I remember reading a blog around this time about some guys in a datacenter in New Orleans and all the things they did to keep their generators and machines operational. Do you have employees who are willing to make personal sacrifices to keep your business going? Will you have the supplies available to keep the people supporting the computers fed and rested?

Test, Test, Test

I highly recommend testing your D/R documentation. If it doesn’t exist, I’d start working on it. Are you prepared to continue functioning when the next disaster strikes? Will a backhoe knock out communications to your site and leave you without the ability to continue serving your customers? Do you have a BCRS contract in place? I know I don’t want to end up like the guy in the cartoon complaining that he did not have good backups and D/R procedures in place. Do you?

Network Troubleshooting

Edit: It has been a while since I needed to mess with SSA disks.

Originally posted September 2005 by IBM Systems Magazine

Recently, a user opened a problem ticket reporting that copying files back and forth from a server we support was taking an unusually long time. The files weren’t all that large, but the throughput was just terrible. After poking around a bit, we found that the Ethernet card wasn’t set to the correct speed. When we ran lsattr -El ent0, we found the media_speed set to Auto_Negotiation. I knew what the problem was immediately.

We’ve seen the Auto_Negotiation setting on Ethernet adapters to be problematic on AIX. Our fast Ethernet port on the switch was always set to be 100/Full. With Auto_Negotiation on, sometimes the card would correctly set itself to 100/Full, but at other times it would go to 100/Half. This causes the slowdown on the network because you now have collisions on the network, which you can see with netstat -v.

Packets with Transmit collisions:

 1 collisions: 204076      6 collisions: 37         11 collisions: 1

 2 collisions: 65375       7 collisions: 6          12 collisions: 0

 3 collisions: 16894       8 collisions: 2          13 collisions: 0

 4 collisions: 2404        9 collisions: 0          14 collisions: 0

 5 collisions: 255        10 collisions: 2          15 collisions: 0

You can also determine if you’re having Receive Errors and see what speed your adapter is running at by using netstat -v.  You’ll see something similar to the following:

RJ45 Port Link Status : up

Media Speed Selected: Auto negotiation

Media Speed Running: 100 Mbps Full Duplex

Transmit Statistics:                      Receive Statistics:

——————–                          ——————-

Packets: 33608151                      Packets: 82280769

Bytes: 3364953629                      Bytes: 89992126877

Interrupts: 15105                          Interrupts: 79762362

Transmit Errors: 0                         Receive Errors: 14000

Packets Dropped: 1                      Packets Dropped: 14

                                                     Bad Packets: 0

How did we fix the duplex issue? We detached the interface and ran a chdev to make it 100/Full: chdev  -l ‘ent0′ -a media_speed=’100_Full_Duplex’. Once we made this change, there were no more collisions and the user was a happy camper.

Verifying Failed SSA Disks

Another issue that seems to crop up is when SSA disks die. How do you know which physical disk in your drawer needs to be replaced? In some instances, when the disk dies, you’re no longer able to go into Diag / Task Selection / SSA Service Aids / Link Verification to select your disk and identify it because it’s no longer responding. 

In this situation, you can use link verification to identify the SSA disks on either side of the failed disk. You can then look for the disk that’s between the two blinking disks, and you know which disk is bad. Another way to verify that you’ve selected the correct disk to replace is to run lsattr -El pdiskX, where “X” is replaced with your failing pdisk number. This provides the serial number that you can match with the serial number printed on the disk. (Note: The serial number may not be an exact match, but you can match fields 5-12 in the output (omit the trailing 00D) with the printed serial number on the disk.) Here’s the highlighted output:

lsattr -El pdisk45

adapter_a       ssa3             Adapter connection                                   False

adapter_b       none             Adapter connection                                   False

connwhere_shad  006094FE94A100D  SSA Connection Location                              False

enclosure       00000004AC14CB52 Identifier of enclosure containing the Physical Disk False

location                         Location Label                                       True

primary_adapter adapter_a        Primary adapter                                      True

size_in_mb      36400            Size in Megabytes

Another way to find your disk based on its location codes is by using lsdev -C | grep pdiskX. After replacing it, you can simply run rmdev -dl pdiskX, swap it with your replacement disk and run cfgmgr.

If your SSA disk was part of a raid array, hopefully at this point your hot spare took over, and you can just make your replacement disk the new hot spare disk. To make your disk a hot spare, use diag / task selection / ssa service aids / smit — ssa raid arrays / change show use of an ssa physical disk, and change your newly replaced disk from a system disk to a hot spare disk. To verify all is well, I like to go into smitty / devices / SSA RAID Arrays / List Status of Hot Spare Protection for an SSA RAID Array. It should report that the raid array is protected and the status is good. Keep in mind that only the latest SSA adapter (4-P) will allow list status of hot spare protection to work; older cards such as the 4-N don’t have this feature.

Exploring Linux Backup Utilities

Edit: I still really like Storix. Relax and Recover is pretty popular as well.

Originally posted April 2005 by IBM Systems Magazine

I’ve been an AIX administrator for a while now, and the mksysb and sysback utilities, which allow me to do bare-metal restores and return my machines to the state they were in when I last performed a backup, have spoiled me. As I’ve worked more with Linux machines, I’ve been bothered that they lack the equivalent utilities.

This is not to say that backup options don’t exist at all for Linux machines. Some use the dd command to copy the entire disk. Many have written scripts that use the UNIX shell command tape archive (tar) to run the open-source utility rsync, which duplicates data across directories, file systems or networked computers.

Others use the Advanced Maryland Automatic Network Disk Archiver (AMANDA) or dump. According to the University of Maryland’s AMANDA Web site, AMANDA lets a LAN administrator, “set up a single master backup server to back up multiple hosts to a single, large capacity tape drive.” AMANDA can use native dump facilities to do this.

Dump itself creates an archive directory and an access interface across whole file systems. It lets you build in specifications for when to run the dump, but it isn’t for everyone because it might not support all the file systems you need.

Two more advanced archive tactics are the cpio and afio utilities, both of which are considered to have better consistency and integrity than tar, with which they’re backward compatible. However, they may require a lot of time reading the man page and other resources to use them effectively.

In my Google search, I also came across “Linux Complete Backup and Recovery HOWTO”, run by software engineer Charles Curley, a 25-year veteran of the computing industry. This site provides instructions for using a backup and restore methods for several Linux products.

The solutions I’ve talked about can be more attractive to Linux users because they’re free. Each of them also makes you build a minimal system before you can restore the rest of your system. When I argue AIX versus Linux, bare-metal restores is usually something I can bring up that Linux advocates can’t address. So I started wondering if there was a tool that was exactly like mksysb or sysback, where you could boot up and restore your machine in one step. The only tool I found–Storix–offers a free personal edition and a free demo edition that you can use to test it with, but if you want the full benefits of this software, you’ll need to buy a license. It isn’t free software, and that may deter some in the Linux community.

Storix offers functionality similar to tools such as tar and secure shell (SSH) for backing up a machine over the network to a remote machine or tape drive, a local file, a tape drive or a USB disk drive. The restore is where Storix has a different functionality, working as a true bare-metal restore. While other tools require you to reload the OS before running the restore process, Storix reloads the entire machine from the bootable CD, eliminating time spent configuring user IDs, groups, permissions, file systems, applications etc. There are far too many places to inadvertently leave out something that has already been fixed when you rely on people to go around rebuilding machines when the hard drive dies. Bare-metal restores may help users feel more comfortable about making changes to the existing system because they can return the machine directly to its previous state.

There are as many methods to back up your machine as there are reasons to choose them. So go ahead and destroy your machine. Just make sure you have a good backup plan before you do so, no matter which tool you choose.

Preparing for Your Certification

Edit: Some links no longer work.

Originally posted December 2004 by IBM Systems Magazine

A co-worker recently finished the requirements for his IBM pSeries certification. I asked what he’d done to prepare for and pass the IBM eServer Certified Advanced Technical Expert: pSeries and AIX 5L (CATE). What follows are some ideas that came up while talking with him and another CATE-certified co-worker. In many cases, the primary attribute one needs to achieve this certification isn’t intelligence or skill with AIX, but the motivation to study and schedule the test.

Some people believe that taking tests and getting certifications are of no benefit and flatly refuse to do so. Others want to take every test available so they can prove to the world that they’re fully qualified for the tasks at hand. I’ve known plenty of people without certifications who were top-notch performers and really knew the material. I’ve also known people with certifications who, despite their book knowledge and test-taking abilities, lacked practical application skills. I believe a certification demonstrates that you’re familiar with the material and know enough about it to go pass a test. In some instances, employers and potential employers will examine your certifications during the hiring process. As is pointed out on IBM’s Certification Web site, certification is a way to “lay the groundwork for your personal journey to become a world-class resource to your customers, colleagues, and company.”

When preparing to take these tests, my friends told me that they would first visit the IBM Web site or directly access the tests, educational resources and sample tests. On IBM’s Certification Web site, the Test Information heading links to education resources, including Redbooks, which can be ordered or downloaded as PDF files. CATE certification requirements are outlined here.

To get started, determine if you meet the prerequisites, some of which are required and some of which are recommended. You can then choose three core requirement tests to take. You must take at least one of these tests–233, 234, 235, 236; you may substitute Test 187 for Test 237 and Test 195 for Test 197. Each test lists objectives, samples, recommended educational resources and assessment tools. After choosing your tests and acquiring your preparation materials, you’re ready to study. Some people take their time, reading Redbook chapters here and there as time permits. Others set study goals–read a chapter a day, read for 30 minutes a day or some other method. Still others find that going to a class works best for them. Use whatever method suits your learning style. At the end of each Redbook chapter is a short quiz to measure your understanding of the material presented in the chapter. These are excellent tools for verifying that you’re ready to take the test.

One co-worker would call the testing location and schedule a day and time for the test. He found that having that deadline looming kept him from procrastinating and forced him to work at acquiring the knowledge. My other co-worker preferred to methodically study the material, and wouldn’t schedule his test until he was sure that he had a good knowledge of the subject.

How ever you go about it, there’s a great satisfaction that comes from passing these certification tests. And while it doesn’t prove anything more than your familiarity with the subject and your ability to pass a test, it may make the difference between you and another candidate in your hunt for a promotion or a new position.

Software Provides ‘Remote’ Possibilities

Edit: I still use these tools. Although I cannot remember the last time I ran telnet.

Originally posted June 2004 by IBM Systems Magazine

Have you ever wanted to remotely control your Windows* machine from a machine running AIX* or Linux while you were working on the raised floor? Have you ever started a long-running job from your office and wanted to disconnect and reconnect from home or another location? Have you ever rebooted your machine after a visit from the “blue screen of death”? Did the reboot interrupt your Telnet or Secure Shell (SSH) session, requiring you to log back and start over again? Have you ever wanted to share your desktop with another user for training or debugging purposes?

If you answered yes to any of these questions, then Virtual Network Computing (VNC) and screen are two useful tools worth investigating.

VNC Benefits

Developed by AT&T, VNC is currently supported by the original authors along with other groups that have modified the original code. Realvnc, tridiavnc and tightvnc are all different versions that interoperate seamlessly.

VNC has cross-platform capabilities. For example, a desktop running on a Linux machine can be displayed on a Windows PC or a Solaris* machine. The Java* viewer allows desktops to be viewed with any Java-capable browser. With the Windows server, users can view the desktop of a remote Windows machine on any of these platforms using the same viewer.

This free tool is quick to download–the Java viewer is less than 100K. The AIX* toolbox for Linux applications also has a copy of VNC.

VNC is comparable to pcAnywhere or other widely used remote-control software. VNC’s power is in the number of OSs that it can allow to interoperate. AIX controlling Linux, Linux controlling Windows, Windows controlling them both–these are just some of the possibilities.

After loading VNC with smitty, you can start it by running vncserver on the command line. I recommend creating a separate user that hasn’t previously logged into an X session; I’ve seen strange behavior when using the same user ID to start a normal X and VNC session.

vncserver twice prompts you for the same password. This information is stored in ~/.vnc/passwd and can be changed with the vncpasswd command. (Note: This directory also contains the xstartup configuration file, along with some log files that show the times and IP addresses of the clients that have connected to vncserver.) Each time you run vncserver, you’ll have another virtual X desktop. (The first session runs on :1, the second on :2, etc.)

Verify that VNC is running with the ps -ef | grep vnc command (you should see Xvnc running). Connect to your server from your client machine, then run vncviewer. When prompted for the server name, enter either the IP address or the host name, followed by the session number you’re connecting to. VNC Web sites usually use snoopy:1 as a sample host name. You should then be prompted for the password you set up earlier.

At this point you should see an X desktop. The settings and version (CDE, KDE, etc.) can all be specified in ~/.vnc/xstartup. To allow others to view this session, select the shared session option from the command line or the shared session setting from the GUI if your viewer is running in Windows. Once two or more users have connected to the vncserver, you can share the session, unless you select the “view only” option when connecting.

Much of this information applies when running vncserver on Windows. Once you download the installer, install the service and set up a password, you should be able to connect to your Windows machine by running vncviewer on your AIX machine. When you connect to your Windows machine, you don’t need to specify any display numbers after the host name, as there’s only one screen that you can connect to on a Windows machine.

This is a powerful tool when team members are working remotely and are having trouble explaining to you what they’re seeing. Once you fire up a VNC session, the problem is usually apparent. It’s also a great tool to install applications that require X when nobody feels like walking out to a raised floor or when the machine is running headless and nobody wants to hook up a monitor to it.

The real power comes when you close down vncviewer and then run it again from another location. You’ll be connected right where you left off–assuming the machine hasn’t been rebooted and no one’s stopped the vncserver process. When it’s time to stop the process, run vncserver -kill :X (X is the session number you’re running for your individual situation).

The Power of Screen

Another useful tool included with the toolbox CD is screen, which allows a physical terminal to handle several processes, typically interactive shells. After loading screen with smitty, enter “screen” on the command line. You’ll see copyright information; hit “space” or “enter” to proceed. You’ll then see a typical command prompt.

To get started, vi a file, or read a manual page. When you need another command line, simply enter “ctl-a c” to create another session. You’ll be greeted with another prompt. You can continue to create several virtual sessions and toggle between them by entering “ctl-a space.” Entering “ctl-a 1” returns you to the first screen, and “ctl-a 2” to the second, etc. To list all of your windows, enter “ctl-a w.” This displays all of the key bindings that are possible in screen. If you need to detach from your session, enter “ctl-a d.” If you’re on another machine and want to attach to a screen running elsewhere, run screen -d (to detach) and screen -r (to reattach).

Power Combo

When I combine VNC and screen, I can use vncviewer to connect to my vncserver, which is running an xterm that’s running screen. My xterm takes up a small amount of real estate on the desktop, and I can quickly and efficiently move between my virtual command lines. This allows me to remain logged into multiple machines and quickly and easily switch back and forth. I also can easily cut and paste between OSs–cutting from AIX and pasting into an e-mail client running on Windows or taking information from Windows and pasting it into my vncsession.

VNC and screen are a powerful combination. With these tools, you can drop what you’re doing (or get dropped, in the case of a network or OS outage), go to another location and pick up where you left off. It’s a handy way to work.

VMware Provides Virtual Infrastructure Solutions

Edit: Much has changed since this has written, but it is still a great tool to run multiple operating systems on the same machine.

Originally posted May 2004 by IBM Systems Magazine

So you’ve heard Linux is the wave of the future and you want to try it out, but you don’t have a spare machine to load it on? You find yourself on the road with one laptop but would like to be able to run more than one operating system without dual booting? You’re already running Linux as your desktop OS and have the occasional need to run Windows applications? You do development work and need test machines to crash and burn? VMware may be the right solution for you.

At vmware.com, you can download the trial version and try a demo version of this software for 30 days or purchase the software from the Web site. There’s good documentation on the site to give you a more thorough overview of what you can expect when you run the software, and it goes into detail about the VMware server offerings. I’ll be focusing on the workstation version of VMware.

Once you’ve loaded the software, configured the amount of memory and disk you plan to allocate to your virtual machine, and decided which type of networking support you want, you’re ready to power it on. You treat VMware like a regular PC, and you can create as many virtual machines as you like. This way you can run RedHat, SUSE, Debian, Gentoo, Mandrake and Windows 95, 98, or XP all on the same computer at the same time. (Although I wouldn’t recommend trying to run them all at once, unless you have a huge amount of RAM.) Put the boot CD for the OS into the CD-ROM drive, press the virtual power button and your virtual machine will load your OS for you.

Once you’ve loaded and patched your OS, you can load the VMware tools package. From the Web site, “With the VMware Tools SVGA driver installed, Workstation supports significantly faster graphics performance. The VMware Tools package provides support required for shared folders and for drag and drop operations. Other tools in the package support synchronization of time in the guest operating system with time on the host, automatic grabbing and releasing of the mouse cursor, copying and pasting between guest and host, and improved mouse performance in some guest operating systems.”

I find that VMware really shines when I put it into full screen mode, and forget that I’m even running a guest operating system. I use the machine as if it were running Linux natively, until I find I need to do something in Windows, at which time I “Ctrl-Alt” back into my Windows session. Then, when I’m finished, I either power down the virtual machine (shutdown -h now) or just hit the suspend button to hibernate my virtual session so that I can pick it up where I left off the next time I want to use that OS.

Another nice feature is the ability to take the disk files (Linux.vmdk) that represent my current hard drive configuration and copy them to CD or send them over the network to a coworker. That coworker can then boot your exact machine configuration to help you look at bugs, see how your desktop is set up or see exactly how your OS is configured.

If your organization finds that it doesn’t have the funds to allocate to a room full of test machines, or you need to take multiple machines with you for a presentation or demo, VMware is a solution to consider. Why settle for running Linux on that old machine in the corner when you can run it at the same time you run your primary workstation?