To VIOS or Not to VIOS

Edit: I assume the VIO server is right for you. Some links no longer work.

Consider whether the Virtual IO Server is right for you

Originally posted September 2010 by IBM Systems Magazine

I attended an OMNI user group meeting a while ago and during the meeting, someone mentioned the difference between attending an education event where you’re familiar with the topic versus a topic you know little about. While you’ll probably learn something at the familiar event, it may only be 1-2 percent added knowledge and a lot of repeated information. But at an event that’s unfamiliar, 50-60 percent of the material may be new to you and it might feel like you’re drinking from a fire hose as you try to digest all of these new ideas and concepts.

At one event, you’re comfortable. At the other, you can feel overwhelmed or wonder why you don’t already know these concepts. Rather than beat yourself up about the knowledge that you haven’t been exposed to yet, see it as an opportunity to learn something new.

Nowhere was that concept clearer at that meeting than during a discussion about whether or not to use the Virtual IO Server (VIOS).

To VOIS or Not to VOIS

Several people had a lively discussion around the pros and cons of virtual IO, but it was clear to me that many were unfamiliar with or misunderstood the capabilities of VIOS. They kept trying to compare VIOS to the managed partition they remembered from years past, which seemed to be all bad memories. They worried about VIOS being a single point of failure or adding a layer of complexity to their server. I’m not certain the IBM i world is fully on board with this solution yet.

At the meeting, it took a while to dispel the myths. Those of us with VIOS experience explained that you can have dual VIO servers so that VIOS is no more of a single point of failure than internal disks would be. With PowerVM virtualization and VIOS, you can continue to add more LPARs to your frame as long as you have available CPU and memory. You don’t have to spend more money for adapters or disks, which leads to lower overall costs compared with dedicated adapters and disks. Using VIOS, you could very easily set up test systems on the same frame as your production systems using this scenario. Rapid provisioning becomes a reality when your environment is virtualized, as you’re not making any changes to physical hardware.

Using VIOS, you could share your storage environment and ‘play nice’ with the rest of the servers in the organization. Instead of people saying that you have an oddball/proprietary/expensive/closed machine sitting in the corner, you can tell them that besides running IBM i, you can also run AIX or Linux—all on the same frame, all sharing the same back-end storage-area network (SAN) and the same network and disk adapters.

Once the meeting attendees understood what you could do with VIOS, and they realized you can pretty much set it up and forget it (until you need to deploy new partitions, and even then it’s a straightforward process), it seemed to me that some warmed up to the idea of virtualizing using VIOS.

More recently, a midrange.com thread entitled “To use VIOS or Not to use VIOS, that is the question” discussed the same types of concerns about complexity and which systems should be primary or guest partitions.

I’ve written twice before on IBM i and VIOS—in a “My Love Affair with IBM i and AIX” blog entry and an article called “Running IBM i and AIX in the Same Physical Frame”—and I think the whole issue boils down to time, availability and training. It takes time to get comfortable with something new. None of us started working on IBM i and were experts in it within a week. It took time to become proficient. The same can be said for VIOS. If you come from a UNIX background, it can help, but the padmin user interface is foreign even to AIX administrators the first time they log into it. Things are just different enough that AIX admins have to learn the padmin/VIOS interface the same way that IBM i admins do. One great resource to start with is the VIO Cheat Sheet.

Real-World Experience

How do you learn VIOS if you don’t have VIOS to play with? Without a test box to work on it can be difficult to learn and understand. You can read IBM Redbooks publications such as “Virtual I/O Server Deployment Examples” and attend lectures on the topic (I recommend the Central Region Virtual Users group, for replayed lectures on many topics, including VIOS configuration overviews), but without hands on experience, it can be difficult to become proficient. I’d argue that this is the same as hiring a new IBM i admin, but then asking him to read manuals and Redbooks publications, without ever letting him log into the machine. He’ll probably not be very effective. With time, access to a server running VIOS and training, anyone can become comfortable with it.

Recently, a customer had a new POWER7 770 server that they were adding 25 AIX and two VIOS partitions to. No problem. I loaded VIOS on the internal disks, and the AIX partitions all booted from SAN. They wanted to get their feet wet with IBM i on the 770, and they wanted to see how it would perform using SAN disks instead of internal disks. No problem. I assigned the proper CPU and memory like I would for any new partition, but I didn’t assign any real IO devices. I assigned it virtual SCSI adapters and a virtual network adapter. It was getting its disk from a SAN. It was going to boot from SAN. I didn’t even use physical media to install it; I just used a virtual optical device in the VIO server and booted the LPAR from there. I used the open source tn5250 program to connect to the console, and we were able to load IBM i on the machine to test it out. They were very pleased with the performance that they saw with the SAN and the POWER7 server.

Make an Informed Decision

Of course, one size doesn’t fit all and there are plenty of great reasons to exclude VIOS from your environment. Maybe you don’t have the need for multiple workloads or virtualization on Power hardware. Maybe you don’t have a SAN in your environment and don’t see one coming any time soon. But don’t let fear of the unknown or memories of the way things once were steer your current thinking around virtualization. Make yourself aware of the pros and cons, and make an informed decision.

Seamless Transitions

Edit: Have you upgraded yet?

The upgrade to AIX 7 is hassle free and benefit rich

Originally posted September 2010 by IBM Systems Magazine

Last month, IBM announced AIX* 7 and its general availability date of Sept. 10. Companies with current IBM software-maintenance agreements receive this upgrade at no charge, meaning adoption should be swift. Technologists in your company will likely be eager to schedule operating-system upgrades and start using the new features. Because of the open beta, many are already testing it. See “Open Beta” for more details.

[The AIX* 7 open beta program, where you could freely download and test the operating system, has been ongoing this summer, and downloads are scheduled to continue through October. Many of your AIX administrators and IT staff have already downloaded the AIX 7 images and have begun testing the new operating system.

You can install the open beta onto your POWER4* or better hardware, but you can’t take that open beta installation and then upgrade or migrate it. You’ll need to do a fresh reinstallation of AIX 7 after it’s generally available. The open beta is meant for test systems and becoming familiar with the operating system and its new features, not for production workloads. Assume that everything you do on this test machine will need to be redone after installing from the official release media.

—R.M.]

Why 7?

Take some care when calling AIX 7 a new version of the operating system; it’s really more of an evolution or continuation of AIX 6. The upgrade from AIX 5.3 to AIX 6 was considerably more extensive than the change from AIX 6 to AIX 7, which might be considered a fine-tuning. For instance, the operating-system default parameters make more sense when we do a fresh install of AIX 6 compared with the tuning and tweaking needed with a fresh installation of AIX 5.3, and AIX 7 will continue with the default settings making sense for the majority of customers.

Some people wanted to call this new release AIX 6.2, but IBM went with AIX 7 in part because of the POWER7* hardware releases. Don’t let the name make you worry about switching in your environment. According to IBM Marketing Manager Jay Kruemcke, “If you’ve been waiting to upgrade, now’s a good time to do so.” Kruemcke points to the binary-compatibility guarantee—where IBM states: “Your applications, whether written in house or supplied by an application provider, will run on AIX 7 if they currently run on AIX 6 or AIX 5L—without recompilations or modification”—and IBM’s great history of binary compatibility throughout the years.

Most IT staff will make time in their busy schedules to test new versions of operating systems as soon as they can. With open beta, they may have already reported the results of their testing and be making the case for moving to AIX 7 now. The case is strong.

If you’ve been waiting to upgrade, now’s a good time to do so. —Jay Kruemcke, IBM marketing manager

The Power of POWER

As you consider AIX 7, it’s important to know what version of POWER* systems (or older RS/6000* systems) and the operating system your company is running. Many of you may be surprised to discover that you’re running AIX 5.2 on older hardware. This version of the AIX operating system was withdrawn from marketing in July 2008, but, for whatever reason, some companies still need it running in their environments. This old machine is typically hosting an application that can’t be upgraded—or may not be worth the effort to upgrade—and it’s typically running on older, slower, less energy-efficient, nonvirtualized hardware. “When that’s the case, you’re missing out on great performance enhancements, new features and cost savings,” Kruemcke says.

Although AIX 7 can run on POWER4* or later hardware, consider running it on POWER7 hardware. A huge benefit of AIX 7 running on POWER7 hardware is the capability to collect those older AIX 5.2 operating-system images, take a system backup (mksysb), and install that AIX 5.2 backup image without modification into an AIX 7 workload partition (WPAR). Once your mksysb image has been created and moved to your POWER7 system, you can give a flag to the WPAR creation command (mkwpar) and restore that backup image into a WPAR running inside AIX 7. Since these AIX 5.2 WPARs will run on top of AIX 7, you’ll also benefit from POWER7’s simultaneous multithreading with four threads and greater performance. This is an excellent way to consolidate old workloads running on less-efficient hardware.

You should immediately see improved performance after moving your workload to a POWER7 server from older hardware, and you’ll enjoy all of the benefits of virtualization on new hardware. “Customers who’ve never looked at WPARs before will take a second look,” Kruemcke says.

Moving your AIX 5.2 system to POWER7 hardware makes it part of an LPAR. It can be part of a micropartitioned pool of processors and donate any idle cycles back into the shared processor pool, and it can have its disk and networking virtualized and handled through VIO servers. The whole LPAR can move to another POWER7 machine in your environment using Live Partition Mobility, or just the WPAR itself can move to another POWER7 machine via Live Application Mobility. Your older operating system can now benefit from all of the advantages of the latest technology, without upgrading the operating system and application.

If you choose to run AIX 5.2 in a WPAR, you’ll have access to IBM phone support, and the operating system will have patches available for critical issues. Instead of needing extended IBM support contracts for your AIX 5.2 machines, you can get ongoing support through your regular maintenance contracts.

Why WPARs?

Nigel Griffiths, Power Systems* technical support, IBM Europe, says companies will see a quadruple win with this move: They’ll remove end-of-life slower machines from their environments, do away with the higher electricity costs of those older machines, eliminate the higher hardware-maintenance costs for those older machines, and decrease the data-center footprint of machine and network cabling.

“WPARs have some great advantages over LPARs,” Griffiths says. “WPARs can be created faster than LPARs, LPARs need more memory to boot compared to WPARs, and you can share application code between multiple WPARs compared to having the same application sitting across LPARs, to name a few.”

Although the WPAR adoption rate has been slow so far, Kruemcke says new WPAR capabilities will cause more people to consider them. Besides running AIX 5.2 in a WPAR, you’ll also have support for NPIV and VIOS storage with WPARs in AIX 7, as the operating system includes support for exporting a virtual or physical Fibre channel adapter to a WPAR. In the new release, the adapter will be exported to the WPAR in the same manner as storage devices.

If you’re running AIX 5.3 on POWER7 hardware, keep in mind that you’re running in POWER6* compatibility mode and aren’t fully exploiting the new hardware. “Since you can upgrade directly from AIX 5.3 to AIX 7, it makes sense to do that upgrade and enjoy the performance benefits of running AIX 7 in POWER7 mode on POWER7 hardware,” Kruemcke says.

What’s New?

Although not a major change, AIX 7 boasts some nice new features.

1,024 threads. AIX 7 supports a large LPAR running 1,024 threads, compared with 256 threads in AIX 6. This large LPAR contains 256 cores, and each core can run four threads, providing the capability to run 1,024 threads in a single operating-system image. If your business needs a very large machine running a massively scaled workload, this thread boost will be a huge benefit. Even if you don’t think you need the capability, it’s nice to know you can migrate your workload into this large environment if needed.

AIX Profile Manager. Besides the massive scalability and the capability to run AIX 5.2 in a WPAR, AIX 7 also supports the AIX Profile Manager, formerly known as the AIX Runtime Expert. An IBM system director plugin, AIX Profile Manager provides configuration management across a group of systems. This lets you see your current system values, apply new values across multiple systems and compare values between systems. Configuring and maintaining your machines can be easier, and you can verify that machine settings haven’t changed over time. You can also set up one machine, then copy its properties across multiple systems. These profiles and properties might include environment variables, tuneables and security profiles.

Systems Director. AIX 7 has also made a change in Web-based System Manager (WebSM), which now integrates with IBM Systems Director and is called the IBM Systems Director Console for AIX. This provides a Web-based management console for AIX so systems administrators have centralized access to do tasks like viewing, monitoring and managing systems. This tool will let staff manage systems using distributed command execution and use familiar interfaces such as the System Management Interface Tool from a central management control point.

Language support. As more companies around the globe deploy AIX 7, they’ll be happy to know that it supports 61 languages and more than 250 locales based on the latest Unicode technology. Unicode 5.2 provides standardized character positions for 107,156 glyphs, and AIX 7 complies with the latest version. This will make the operating system and applications more accessible for non-English speakers.

Updated shell environment. AIX 7 now provides a newly updated version of the ksh93 environment. AIX 6 provided a ksh93 based upon the ksh93e version of the popular shell. AIX 7 now updates ksh93 to be based upon ksh93t. Users now have access to a variety of enhancements and improvements that the Korn shell community has made over the past several years, resulting in a more robust shell programming experience. Many customers complain about needing to learn to get around in the Korn shell, and AIX 7 should help them see improvements when they run the set –o viraw command. They’ll then have access to tab completion and moving through their shell history file using the arrow keys instead of vi commands. Users of other shells from other operating systems will have one less thing to learn on AIX.

Role-based access control. Many companies still rely on sudo to give nonroot users root user functionality. AIX 7 continues supporting role-based access control (RBAC) but enhances it by providing resource isolation. In previous iterations of RBAC, if you gave someone access to change a device, they could change any device of that type. Now you can limit their access to a specific device on the system. This lets you give a nonroot user access to resources that they can manage, and have more granular control over what they can do.

Clustering. Another highlight of this announcement is the clustering technology that’s being built into the operating system. AIX 7 now has built-in kernel-based heartbeats and messages, and multichannel communication between nodes. It also features clusterwide notification of errors and common naming of devices across nodes. This will let multiple machines see the same disk and have it be called the same name. Built-in security and storage commands support operations across the cluster.

You used to have to purchase HACMP or PowerHA* products and install them on top of AIX to get these features, but now much of that functionality is built into the operating system and better integrated with AIX 7. This should make implementing high-availability clusters easier for administrators.

Continued Investment

IBM continues to make investments in the Power Systems hardware, AIX and software. You can be sure that IBM will continue to stand behind the investments you’re making well into the future. So take AIX 7 for a spin and enjoy the new features.

Those Who Do Without Virtualization

Edit: Still a good topic.

Originally posted November 30, 2010 by IBM Systems Magazine

Working on virtualized systems as much as I do, and talking to people about virtualization as often as I do, I tend to forget a couple things:

  1. Not all IBM Power Systems users have virtualized systems.
  2. Not all of them use VIOS even while they benefit from other aspects of virtualizing their machines.

It isn’t necessarily that these shops are limited by the constraints of older hardware and operating systems. I know of customers with POWER6 and POWER7 hardware that haven’t yet virtualized their systems. Maybe they lack the time or the resources to virtualize more fully, or maybe they simply lack the skills that come only with hands-on experience.

Customers who aren’t hands-on generally don’t realize that virtualization covers a wide range of functionality. Using workload partitions (WPAR) counts as virtualization. Micropartitioning CPU, where we assign fractions of a CPU to an LPAR and then set up processing entitlements and cap or uncap partitions based on our LPAR’s requirements? That’s virtualization. We use VIOS to virtualize disk, the network or both. NPIV allows us to virtualize our fibre adapters and have our clients recognize the LUNs we provision–and it saves us the effort of having to map them to the VIOS and remap them to the VIOS client LPARs. We use the built-in LHEA to virtualize the network. We could create an LPAR with some dedicated physical adapters and some virtual adapters. We could use active memory sharing and active memory expansion to better utilize our systems’ memory. Power Systems offers many choices and scenarios where it can be said that we’re using virtualized machines.

I know some administrators who’ve been unable to convince their management or application vendors of virtualization’s benefits. I know of some IBM i users who are reluctant to get on board with VIOS (though plenty of AIX shops still don’t virtualize, either). Sometimes it’s the vendor that lacks the time, resources or skills for virtualization. For instance, I’ve seen multiple customer sites where tons of I/O drawers are used; the vendor won’t officially support VIOS because the vendor hasn’t tested it, and these customers don’t want to run an unsupported configuration.

I talked to an admin who has experience with configuring logical partitions, setting up dedicated CPUs and dedicated I/O slots in his environment, but he continues to use a dynamic logical partition (DLPAR) operation to move a physical DVD between his different LPARs. It’s the way he’s always done it. He figures that since his shop doesn’t use virtualization is no big deal, since he has no experience with VIOS and virtual optical media anyway. “You can’t miss what you’ve never had,” is how he put it.

Others will tell me that they the see the writing on the wall. They insist they’ll virtualize, some day.

Are there roadblocks keeping you from virtualizing? Are there complications that prevent you from moving to a fully virtualized environment? I’d like to hear about the challenges you face. Please e-mail me or post in Comments.

The Evolution of Education

Edit: Link no longer works.

Originally posted June 29, 2010 by IBM Systems Magazine

As more companies migrate to IBM Power Systems hardware, the need for education grows. It may be hard for us long-time users to imagine, but every day, seasoned pros are just getting started on POWER hardware.

While I’ve provided customer training, what I do–either through giving lectures on current topics or talking to people informally as their systems get built–doesn’t compare to the educational value of a “traditional” instructor-led class or lab.

With that in mind, check into the IBM Power Systems Test Drive, a series of no-charge remote (read: online) instructor-led classes.

Courses being offered include:

IBM DB2 WebQuery for IBM i (AT91)
IBM PowerHA SystemMirror for IBM AIX (AT92)
IBM PowerHA and Availability Resiliency without Downtime for IBM i (AT93)
Virtualization on IBM Power (AT94)
IBM Systems Director 6.1 for Power Systems (AT95)
IBM i on IBM Power Systems (AT96)
IBM AIX on IBM Power Systems (AT97)

Remote training, of course, saves IT pros and their employers the time and expense of having to travel to an educational opportunity. But is something lost if students, instructor and equipment aren’t in the same room? Not necessarily. Let’s face it: Nowadays a lot of education is remote anyway–when you travel to classes and conferences and do lab exercises, you’re likely logging into machines that are located offsite. By now good bandwidth is the norm, so network capacity shouldn’t be an issue when it comes to training.

Sure, offsite training has its advantages. When you travel somewhere for a class, there are fewer distractions, so you can concentrate on the training. Taking training remotely from your office desk, it’s easy to be sidetracked by your day-to-day responsibilities. (This does cut both ways though–I often see people connect to their employer and work on their laptops during offsite training.)

Offsite training also allows you to meet and network with your peers. I still keep in touch with folks I’ve met at training sessions. If I run into a problem with a machine I’m working on, I have any number of people I can contact for help. Being able to tap into that knowledge with just a call or a text message is invaluable.

While I haven’t taken a remote instructor-led class like the ones IBM offers, I’ve heard positive feedback from those who have. But what about you? I encourage you to post your thoughts on training and education in comments.

The Importance of the Academic Initiative

Edit: Some links no longer work.

Originally posted May 18, 2009 by IBM Systems Magazine

In a previous blog entry titled, “Some New Virtual Disk Techniques,” I said that I usually learn something new whenever I attend or download the Central Region Virtual User Group meetings from developerWorks.

For instance, at the most recent meeting, Janel Barfield gave a typically excellent presentation on Power Systems Micro-paritioning. But for this post I want to focus on the IBM Academic Initiative. IBMer Linda Grigoleit took a few minutes to cover material about the IBM Academic Initiative, which is available to high school and university faculty.

From IBM:

“Who can join? Faculty members and research professionals at accredited institutions of learning and qualifying members of standards organizations, all over the globe. Membership is granted on an individual basis. There is no limit on the number of members from an institution that can join.”

Check out the downloadable AIX and IBM i courses and imagine a high school or college student taking these classes. With this freely available education, these students would be well on their way to walking in the door of an organization and being productive team members from the beginning of their employment. Think about the head start you would have had you been able to study these Power Systems AIX or these IBM i course topics at that age.

Although as I said in a previous AIXchange entry titled, “You Have to Start Somewhere,”  I like the idea of employees starting out in operations or help desks in organizations, the Academic Initiative is a great way for people to get real-world skills on real operating systems.

Instructors also benefit from the program, as IBM offers them discounts on certification tests, training and either discounted hardware or free remote access to the Power System Connection Center.

There’s more. From IBM:

“The Academic Initiative Power Systems team provides vouchers for many IBM instructor-led courses to Academic Initiative members at no cost.

“The IBM Academic Initiative hosts an annual Summer school event for instructors. Each summer this very popular event features topics for those new to IBM i platform.”

Maybe it’s time you get involved. Go to your local high school or university. Find the instructors who would be interested in learning and teaching this technology. Get them to sign up with the Academic Initiative and get involved. With your skills and experience, you could help them get started, and your ongoing assistance would be appreciated by instructors and students alike.

AIX and i Worlds Can Learn from Each Other

Edit: Link no longer works.

Originally posted February 24, 2009 by IBM Systems Magazine

I recently read this iDevelop blog post and it got me thinking. I too have been involved in these discussions with a local IBM i user group that recently had a conference planned. The group was forced to cancel the event due to lack of attendance. Was that due to less and less actual users of the platform? Was that due to budget constraints or time constraints, where people just couldn’t take the time to spend a day away from the office? Or had people lost their jobs because their companies went out of business? The conference planners are not sure. All they know is that they wanted to attract enough bodies to their event to cover costs, so they thought that a combined i and AIX conference would be a good thing.

By combining their conference, they had hoped to introduce i people to AIX and Linux. They planned to offer some introductory level tracks so that IBM i people could learn more about AIX. At the same time, introductory tracks were planned to give AIX administrators a better understanding of the benefits of IBM i. But besides the intro classes, power-user sessions were planned, aimed at the serious administrators from both camps.

I was at virtual I/O server (VIOS) training last year that was aimed toward users of IBM i, and it seemed to me that this group didn’t want to hear the message that was being delivered. Instead of trying to understand how IBM i using VIOS attached to external storage would be a good thing to consider, they seemed to focus on how this would be a different way of doing things and they seemed resistant to learning about it.

I also attended an IBM event that featured technical lectures for both i and AIX users, and I watched IBM i users walk out, because they said the event was too slanted toward AIX.

I can certainly agree with the points that the authors of the iDevelop blog make, where you might think that people are watering down content or leaving out sessions in order to accommodate both groups. However, combining events like this might also be an advantage to the attendees. Many shops run IBM i, but they are also running HP servers, Sun servers–some flavor of UNIX. This means that besides the investment in IBM i, these shops are also investing in other vendors’ solutions.

Instead of using all of this different hardware from all of these different vendors, why not consolidate and virtualize the Power Systems server running IBM i in a partition and some number of AIX LPARs in other partitions? While this seems pretty straightforward to someone with an AIX background because we think nothing of running different operating systems and different versions of the same operating system on the same frame, some IBM i people might not realize that this is possible, or what the benefits might be.

There might be discussions in some organizations about eliminating that IBM i machine that just sits in the corner and runs, and taking that workload and running it on Windows or Linux or some flavor of UNIX. If all you understand is IBM i, it might be difficult to articulate its pros and cons versus the other operating systems. There can be a perception that IBM i is still a green-screen 1988 legacy system, instead of a powerful integrated operating system that frankly could use better marketing and education so that more organizations were made aware of its benefits.

If IBM i administrators aren’t keeping up on the trends in the UNIX space, they might be missing a great opportunity to help extend the longevity of their IBM i investments, both in hardware and knowledge. By running more of their company’s workloads on the same hardware from the same vendor, they are now benefiting from having “one throat to choke” if things go wrong, but better than that, they are running the best server hardware currently available.

The problem is, without understanding the basics of AIX and VIOS, and why it can all coexist happily on the same hardware, IBM i administrators might have a difficult time making the case to their management team that this server consolidation could be the way to go.

The IBM technical university that was held in Chicago last fall was a great example of how this can be done–hold tracks that appeal to IBM i administrators and those that appeal to traditional AIX administrators. Let attendees freely move between tracks so that they can learn more about the “other side.” Although they won’t become experts after a few sessions, they should at least start to understand the lingo, the jargon and the benefits that come from the other operating system. AIX administrators might be surprised to learn just how good IBM i is, while i administrators might also be pleasantly surprised to learn just how good AIX is.

Change can be scary, change can be hard, but change will come. How will we deal with it? Will we try to keep our traditional user groups doing the same old thing or will we try to learn more about other technologies? By telling the IBM i story to AIX administrators, at a minimum there will be more people out there that understand the basics of why it is so good, and who might be eager to make the case to management that consolidation might make sense.

An LPAR Review

Edit: Some links no longer work.

Originally posted September 2009 by IBM Systems Magazine

To learn more about this topic, read these articles:
Software License Core Counting
Trusted Logging Simplifies Security
Tools You Can Use: Planning and Memory
Improve Power Systems Server Performance With Virtual Processor Folding
Now’s the Time to Consider Live Partition Mobility
Improve Power Systems Server Performance With Enhanced Tools
How to Use rPerfs for Workload Migration and Server Consolidation
Entitlements and VPs- Why You Should Care
Three Lesser-Known PowerVM Features Deliver Uncommon Benefits

In 2006 IBMer Charlie Cler wrote a great article that helps clear up confusion regarding logical, virtual and physical CPUs on Power Systems (“Configuring Processor Resources for System p5 Shared-Processor Pool Micro-Partitions”). This subject still seems to be a difficult concept for some people to grasp, particularly those who are new to the platform or are unfamiliar with the topic. But if you put in the research, there are a lot of quality resources available.

I recently saw Charlie give a presentation to a customer where he covered this topic again, and I based this article on the information that he gave us that day, with his permission.

When you’re setting up LPARs on a hardware management console (HMC), you can choose to have dedicated CPUs for your LPAR, which means an LPAR exclusively uses a CPU; it isn’t sharing CPU cycles with any other LPAR on the frame. On POWER6 processor-based servers you can elect to have shared, dedicated processors–where the system allows excess processor cycles from a dedicated CPU’s LPAR to be donated to the shared processor pool.

Instead of using dedicated or shared dedicated CPUs, you could choose to let your LPAR take advantage of being part of a shared pool of CPUs. An LPAR operates in three modes when it uses a shared pool: guaranteed, borrowing and donating. When your LPAR is using its entitled capacity, it isn’t donating or borrowing from the shared pool. If it’s borrowing from the pool, then it’s going over its entitled capacity and using spare cycles another LPAR isn’t using. If the LPAR is donating, then it isn’t using all of its entitlement, but returning its cycles to the pool for other LPARs to use.

In his presentation, Cler shared some excellent learning points that I find useful:

  • The shared processor pool automatically uses all activated, non-dedicated cores. This means any capacity upgrade-on-demand CPUs that were physically installed in the frame but not activated wouldn’t be part of the pool. However, if a processor were marked as bad and removed from the pool, the machine would automatically activate one of the deactivated CPUs and add it to the pool.
  • The shared processor-pool size can change dynamically as dedicated LPARs start and stop. As you start more and more LPARs on your machine, the amount of available CPUs continues to decrease. Inversely, as you shut down LPARs, more CPUs become available.
  • Each virtual processor can represent 0.1 to 1 of a physical processor. For any given number of virtual processors (V), the range of processing units that the LPAR can utilize is 0.1 * V to V. So for one virtual processor, the range is 0.1 to 1, and for three virtual processors, it’s 0.3 to 3.
  • The number of virtual processors specified for an LPAR represents the maximum number of physical processors the LPAR can access. If your pool has 32 processors in it, but your LPAR only has four virtual CPUs and it’s uncapped, the most it’ll consume will be four CPUs.
  • You won’t share pooled processors until the number of virtual processors exceeds the size of the shared pool. If you have pool with two LPARs and four CPUs, and each LPAR had two virtual CPUs, there would be no benefit to sharing CPUs. As you start adding more LPARs and virtual CPUs to the shared pool, eventually you’ll have more virtual processors than physical processors. This is when borrowing and donating cycles based on LPAR activity comes into play.
  • One processing unit is equivalent to one core’s worth of compute cycles.
  • The specified processing unit is guaranteed to each LPAR no matter how busy the shared pool is.
  • The sum total of assigned processing units cannot exceed the size of the shared pool. This means you can never guarantee to deliver more than you have available; you can’t guarantee four CPUs worth of processing power if you only have three CPUs available.
  • Capped LPARs are limited to their processing-unit setting and can’t access extra cycles.
  • Uncapped LPARs have a weight factor, which is a share-based mechanism for the distribution of excess processor cycles. The higher the number, the better the chances the LPAR will get spare cycles; the lower the number, the less likely the LPAR will get spare cycles.

When you’re in the HMC and select the desired processing units, it establishes a guaranteed amount of processor cycles for each LPAR. When you set it to “Uncapped = Yes,” an LPAR can utilize excess cycles. If you set it to “Uncapped = No,” an LPAR is limited to the desired processing units. When you select your desired virtual processors, you establish an upper limit for an LPAR’s possible processor consumption.

Charlie gives an example of an LPAR with two virtual processors. This means the assigned processing units must be somewhere between 0.2 and 2. The maximum processing units the LPAR can utilize is two. If you want this LPAR to use more than two processing units worth of cycles, you need to add more virtual processors. If you add two more, then the assigned processing units must now be at least 0.4 and the maximum utilization is four processing units.

You need to consider peak processing requirements and the job stream (single or multi-threaded) when setting the desired number of virtual processors for your LPAR. If you have an LPAR with four virtual processors and a desired 1.6 processing units–and all four virtual processors have work to perform–each receives 0.4 processing units. The maximum processing units available to handle peak workload is four. Individual processes or threads may run slower, while workloads with a lot of processes or threads may run faster.

Compare that with the same LPAR that now has only two virtual processors instead of four, but still has a desired 1.6 processing units. If both virtual processors have work to be done, each will receive 0.8 processing units. The maximum processing units possible to handle peak workload is two. Again, Individual processes or threads may run faster, while workloads with a lot of processes or threads may run slower.

If there are excess processing units, LPARs with a higher desired virtual-processor count are able to access more excess processing units. Think of a sample LPAR with four virtual processors, desired 1.6 processing units and 5.8 processing units available in the shared pool. In this case, each virtual processor will receive 1.0 processing units from the 5.8 available. The maximum number of processing units that can be consumed is four, because there are four virtual processors. If the LPAR only has two virtual processors, each virtual processor will receive 1.0 processing units from the 5.8 available, and the maximum processing units that can be consumed is two, because we only have two virtual processors.

The minimum and maximum settings in the HMC have nothing to do with resource allocation during normal operation. Minimums and maximums are limits applied only when making a dynamic change to processing units or virtual processors using the HMC. The minimum setting also allows an LPAR to start with less than the desired resource allocations.

Another topic of importance Cler covered in his presentation is simultaneous multi-threading (SMT). According to the IBM Redbooks publication “AIX 5L Performance Tools Handbook (TIPS0434, http://www.redbooks.ibm.com/abstracts/tips0434.html?Open), “In simultaneous multi-threading (SMT), the processor fetches instructions from more than one thread. The basic concept of SMT is that no single process uses all processor execution units at the same time. The CPU design implements two-way SMT on each of the chip’s processor cores. Thus, each physical processor core is represented by two virtual processors.” Basically, one processor, either dedicated or virtual, will appear as two logical processors to the OS.

If SMT is on, AIX will dispatch two threads per processor. To the OS, it’s like doubling the number of processors. When “SMT = On,” logical processors are present, but when “SMT = Off,” there are no logical processors. SMT doesn’t improve system throughput on a lightly loaded system, and it doesn’t make a single thread run faster. However, SMT does improve system throughput on a heavily loaded system.

In a sample LPAR with a 16 CPU shared pool and SMT on, 1.2 Processing Units, three virtual processors and six logical processors: the LPAR is guaranteed 1.2 processing units at all times. If the LPAR isn’t busy, it will cede unused processing units to the shared pool. If the LPAR is busy, then you could set the LPAR to capped which would limit the LPAR to 1.2 processing units. Alternatively, uncapped would allow the LPAR to use up to three processing units, since it has three virtual processors.

To change the range of spare processing units that can be utilized, use the HMC to change desired virtual processors to a new value between the minimum and maximum settings. To change the guaranteed processing units, use the HMC to change desired processing units to a new value between the minimum and maximum settings.

When you think about processors, you need to think P-V-L (physical, virtual, logical). The physical CPUs are the hardware on the frame. The virtual CPUs are set up in the HMC when we decide how many virtual CPUs to give to an LPAR. The logical CPUs are visible and enabled when we turn on SMT.

When configuring an LPAR, Cler recommends setting the desired processing units to cover a major portion of the workload, then set desired virtual processors to match the peak workload. LPAR-CPU utilization greater than 100 percent is a good thing in a shared pool, as you’re using spare cycles. When you measure utilization, do it at the frame level so you can see what all of the LPARs are doing.

There’s a great deal to understand when it comes to Power Systems and the flexibility that you have when you set up LPARs. Without a clear understanding of how things relate to each other, it’s very easy to set things up incorrectly, which might result in performance that doesn’t meet your expectations. However, by using dynamic logical-partitioning operations, it can be easy to make changes to running LPARs, assuming you have good minimum and maximum values. As one of my colleagues says, “These machines are very forgiving, as long as we take a little care when we initially set them up.”

Other Resources

IBM developerWorks
Virtualization Concepts

IBM Redbooks publications
PowerVM Virtualization on IBM System p: Introduction and Configuration Fourth Edition” (SG24-7940-03)

IBM PowerVM Virtualization Managing and Monitoring” (SG24-7590)

IBM Systems Magazine articles
Mapping Virtualized Systems

Shared-Processor Pools Enable Consolidation

My Love Affair with IBM i and AIX

Edit: Another good one.

Originally posted December 1, 2008 by IBM Systems Magazine

I started my IT career in 1988 as a computer operator, specializing in AS/400 servers running OS/400. It was love at first sight. The commands made sense, I learned to love WRKACTJOB and WRKJOBQ and QPRINT and QBATCH. I’d happily vary on users who’d tried to log onto their green-screen terminal too many times with the wrong password. I’d configure my alphanumeric pager to send me operator messages that were waiting for a reply. Printing reports on green bar paper and changing to different print forms became an art. You had to know where to line up the paper, and when to hit G on the console. You had to manage the backup tapes and the backup jobs. For the most part, interactive response time took care of itself, although occasionally I’d have to hold a batch job during the day if things weren’t running smoothly.

You can call it i5/OS, you can call it IBM i, you can call it whatever you want, but to me it will always be OS/400. Among many people I talk to, the feeling is the same. Name it what you want, just keep supporting and selling it, because we love it.

I worked for three different companies doing AS/400 computer operations, and the OS/400 learning curve wasn’t very steep whenever I made a change. The simplicity and the elegance of the interface was the same. The computers just worked. Sure, the machines ran different applications since the companies were in different industries. In some cases, they were in different countries. The green screens looked the same no matter where I worked. I can remember hardware issues where we would lose a 9336 disk, replace it, and the machine would keep on running. I can remember human error causing issues; however, I don’t remember the operating system locking up like others have been known to do. I can’t remember wishing I were running something else. OS/400 was and is a rock-solid platform on which to run a business.

My head was turned in 1998, and I left my first love and started my affair with AIX. I traded QSECOFR for root. There’s much to be said for AIX and open systems. I also like the way things are structured in this operating system. It can seem familiar to people with Solaris or Linux skills, although there will be new things to learn, like the Object Data Manager (ODM) and Systems Management Interface Tool (smitty). A friend likes to dismiss AIX by calling it “playing with tinker toys.” I can connect the operating system to all kinds of disk subsystems from all kinds of manufacturers. I can use third-party equipment to manage my remote terminal connections if I want to. I can run all kinds of applications from all kinds of vendors. Since it lives in the UNIX world, its heritage is considered to be more open and less proprietary, although I’m sure that open-source adherents and members of the free software foundation would argue that point.

I’ve become accustomed to things taking a certain amount of tinkering to get them to work. I know that I may have to load some drivers, or configure a file in the /etc directory to tell a program how to behave. I have to pay attention to disk consumption, file system sizes, volume groups, etc. I accept all of that as part of the whole package. Some from the IBM i world hear about this and shake their heads and wonder why anyone would put up with it.

Now that POWER servers have been consolidated and AIX and IBM i will run on the same machine, it makes sense to see what can be shared. How can we take our current AIX and IBM i environments and run them all on the same physical frame? During this exploration, I’ve been hearing a great deal of resistance from the i community. Part of this might be a natural response to any kind of change. Change can be scary and painful. However, since I’ve spent a bit of time in both the AIX and IBM i worlds, I think I can safely say it shouldn’t be scary and it definitely isn’t painful. It’s just another set of commands to learn, but once you learn them, it’s just like anything else in IT.

I’ve begun playing with IBM i again and it’s like I never left. I’ve written an article on implementing IBM i using Virtual I/O Servers (VOIS). If you’re an IBM i administrator, the idea of running IBM i as a client of VIOS might sound intimidating, it’s not. In years past, IBM i has hosted AIX and Linux partitions. Using VIOS is the exact same concept, only instead of your underlying operating system being IBM i based, it’s VIOS, which is AIX based. If you want to know why you should bother, check out “Running IBM i and AIX in the Same Physical Frame,” then let me know what you think. I still look back fondly at my first true love, and I’m glad it’s still being well positioned for the future.

If you’re an AIX administrator, offer help to IBM I administrators who might be nervous about running VIOS to connect to external disk. In some larger shops, these teams might not spend much time together, but it’s time to change that mentality.

Run IBM i and AIX in the Same Physical Frame

Edit: Some links no longer work.

POWER technology-based servers allow for consolidation

Originally posted December 2008 by IBM Systems Magazine

As I wrote in my blog titled “My Love Affair with IBM i and AIX”, I started my career working on AS/400 servers running OS/400 – and I loved it. Then I started working on AIX – and I loved that. AIX has been my world for the past decade.

During that time, AIX customers began using Virtual I/O Servers (VIOS) to consolidate the number of adapters needed on a machine. Instead of 10 standalone AIX servers they would have one or two frames and use pools of shared processors. But all these LPARs needed dedicated hardware adapters. So they consolidated again to share these resources. This required a VIOS with shared network adapters and shared Fibre adapters, which reduced the number of required physical adapters.

Now POWER technology-based servers have been consolidated and AIX and IBM i will run on the same machine, it makes sense to see what else can be shared. How can we take our current AIX and IBM i environments and run them on the same physical frame?

The IBM Technical University held this fall in Chicago offered sessions for AIX customers and for IBM i customers. If you were like me, you bounced between them. Great classes went over the pros and cons of situations where IBM i using VIOS as a solution may make sense. Although the idea of running IBM i as a client of VIOS might sound intimidating, it’s not. In years past, IBM i has hosted AIX and Linux partitions. Using VIOS is the same concept, only instead of your underlying operating system being IBM i-based, it’s VIOS, which is AIX-based.

Great documentation has been created to help us understand how to implement IBM i on VIOS. Some is written more specifically for those running IBM i on blades, but it’s applicable whether you’re on a blade or another Power Systems server. Many shops already have AIX skills in house, but if you don’t, it can be very cost-effective to hire a consultant to do your VIOS installation. Many customers already bring in consultants when they upgrade or install new hardware, so setting up VOIS can be something to add to the checklist. You can also opt to have IBM manufacturing preinstall VIOS on Power blades or Power servers.

Answering the Whys

Why would you want to use VIOS to host your disk in the first place? VIOS is able to see more disk subsystems than IBM i can see natively. As of Nov. 21, the IBM DS3400, DS4700, DS4800, DS8100 and DS8300 are all supported when running IBM i and VIOS, and I expect the number of supported disk subsystems to increase. You can also use a SAN Volume Controller (SVC) with VIOS, which lets you put many more storage subsystems behind it–including disk from IBM, EMC, Hitachi, Sun, HP, NetApp and more. This way you can leverage your existing storage-area network (SAN) environment and let IBM i connect to your SAN.

The question remains, why bother with VIOS in the first place? These open-system disk units are expecting to use 512 bytes per sector, while traditional IBM i disk units use 520 bytes per sector. By using VIOS, you’re presenting virtual SCSI disks to your client LPARs (vtscsi devices) that are 512 bytes per sector. IBM i’s virtual I/O driver can use 512 bytes per sector, while none of the current Fibre Channel, SAS or SCSI drivers for physical I/O adapters can (for now). IBM i storage management will expect to see 520 bytes per sector. To get around that, IBM i uses an extra sector for every 4 K memory page. The actual physical disk I/O is being handled by VIOS, which can talk 512 bytes per sector. This, in turn, allows you to widen the supported number of disk subsystems IBM i can use without forcing the disk subsystems to support 520 bytes per sector.

But again, why bother? It’s certainly possible you don’t need to implement this in your environment. If things are running fine, this makes no sense for you. This solution is another tool in the toolbox, and another method you can use to talk to disk. As alternative solutions are discussed in your business, and people are weighing the pros and cons of each, it’s good to know VIOS is an option.

Do you currently have a SAN or are you looking at one? Are you thinking about consolidating storage for the other servers in your environment? Are you considering blade technology? Are you interested in running your Windows, VMware, Linux, AIX and IBM i servers in the same BladeCenter chassis? If you have an existing SAN, or you’re thinking of getting one, it may make sense to connect your IBM i server to it. If you’re thinking of running IBM i on a blade, then you most certainly have to look at a SAN solution. These are all important ideas to consider, and you may find significant savings when you implement these new technologies.

VIOS

When I was first learning about VIOS, a friend of mine said this was the command I needed to share disk in VIOS: mkvdev -vdev hdisk1 -vadapter vhost1

When you think of an IBM i command, mkvdev (or make virtual device) makes perfect sense. I find that to be true of many AIX commands. You give the command the disk name (in this case an hdisk known to the machine as hdisk1) and the adapter to connect it to. On the IBM i client partition, a disk will appear that’s available for use just like any other disk.

To take it from the beginning, you’d have already set up your server and client virtual adapters, and your SAN administrator would zone the disks to your VIOS physical Fibre adapters. You’d log into VIOS as padmin, and after you run cfgdev in VIOS to make your new disks available, you can run lspv (list physical volume) and see a list of disks attached to VIOS.

In my case I see:lspv
NAME PVID VG STATUS
hdisk0 0000bb8a6b216a5d rootvg active
hdisk1 00004daa45e9f5d1 None
hdisk2 00004daa45ebbd54 None
hdisk3 00004daa45ffe3fd None
hdisk4 00004daa45ffe58b None
hdisk5 00004daae6192722 None

This might look like only one disk, hdisk0 in rootvg, is in use. However, if I run lsmap –vadapter vhost3 (lsmap could be thought of as list map, with the option asking it to show me the virtual adapter called vhost3), I’ll see:SVSA     Physloc                    Client Partition
ID
————— ——————————————– —
vhost3   U7998.61X.100BB8A-V1-C17   0x00000004

VTD      vtscsi3
Status   Available
LUN      0x8100000000000000
Backing device    hdisk5
Physloc  U78A5.001.WIH0A68-P1-C6-T2-W5005076801202FFF-L9000000000000

This tells me that hdisk5 is the backing device, and it’s mapped to vhost3, which in turn is mapped to client partition 4, which is the partition running IBM i on my machine.

To make this mapping, I needed to run the mkvdev command:mkvdev -vdev hdisk5 -vadapter vhost3

If I needed to assign more disks to the partition, I could’ve run more mkvdev commands. At this point, I use the disks just as I would any other disks in IBM i.

It might look like gibberish if this is your first exposure to VIOS. Your first inclination may be to avoid learning about it. Don’t dismiss it too quickly. IBM i now has another option when you’re setting up disk subsystems. The more you know about how it works, the better you’ll be able to discuss it.

Although I may find myself more heavily involved with AIX and VIOS, I still look back fondly at my first true love, and I’m glad it’s still getting options added that position it well for the future.

References

www.ibm.com/systems/resources/systems_power_hardware_blades_i_on_blade_readme.pdf

www.ibm.com/systems/resources/systems_i_os_i_virtualization_and_ds4000_readme.pdf

www.redbooks.ibm.com/abstracts/sg246455.html

www.redbooks.ibm.com/abstracts/sg246388.html

www.ibm.com/systems/storage/software/virtualization/svc

Data Protection Versus Risk

Edit: I remember writing this in the airport. Some links no longer work.

Originally posted August 2008 by IBM Systems Magazine

I just took off my shoes, took out my laptop, removed the liquids from my carry-on bags and the metal from my pockets. I walked through a metal detector and my belongings filed through the X-ray machine. This was all done in the name of airport security.

It was inconvenient. It took time to navigate through the lines, show my documents and eventually clear security. It took money to implement and maintain all of the systems in place. However, the inconvenience, lost time and money spent was offset by the idea of keeping attackers out of the system. In fact, the knowledge that these defenses were in place potentially kept many threats at bay.

After all of this, the secure area could still be attacked by determined individuals and organizations. A trusted employee could cause harm. Someone who passed a background check could then turn around and do harm. X-rays and metal detectors are deterrents, but they can’t guarantee that nothing bad will ever happen.

As we all prefer the convenience and time savings we gain when we fly, decisions are made as to acceptable risk and potential inconvenience to travelers. People learn the new security rules and follow them. Airport security and server security involve protecting different things, yet have similar goals. Instead of protecting planes, you want to protect data, keep it on your server safe from attackers and limit system access to authorized personnel.

Planning Network Security

The only secure server is one that’s turned off. After you hit the power button, you have to start trusting people. When you power it on and let people have access, you run the risk of compromised security. You must trust the people that work on the servers. You should decide what activities you’re trying to prevent. Are lives at stake if medical data is updated? Is privacy and financial harm to your customers an issue if Social-Security numbers or credit-card numbers are disclosed? Are trade secrets and confidential business plans at risk if someone has access to sensitive information?

As you think about securing your machines, think about network security, physical security and user security. I’m attempting to get you to think about what you’re doing right, wrong and what you might need to change. This isn’t an all-inclusive list—threats change and evolve, and specifics change—but the basic concepts remain the same.

See the Redbooks* publication, “Understanding IT Perimeter Security” for more information on this topic (www.redbooks.ibm.com/redpapers/abstracts/redp4397.html?Open).

We usually put our machines behind firewalls. In some environments, firewalls aren’t enough, and absolutely no network activity is allowed to the public Internet. Some companies choose to implement different network layers and segregate which machines access which networks. In some environments network traffic isn’t allowed to leave the computer room—you have to use a secure terminal in a secure area to access data. Again, you must weigh what you’re trying to accomplish, from whom you’re trying to protect yourself and what harm will be done if the data you’re trying to secure is compromised.

I once heard about a phone call from a customer to a service provider who was hosting his servers. The customer asked the provider how secure his physical servers were and was told that they were in a raised-floor environment that required a keycard to access. The customer replied that he was standing in front of the servers on the raised floor and he wasn’t happy. Dressed in his normal clothes—not as a maintenance man pretending to do work on the air conditioning unit—the customer gained entry when a friendly, helpful authorized person held the door to the raised floor open for him.

This event caused mantraps to be installed. To access the raised floor, you had to scan your fingerprint and your keycard, and then enter the mantrap one employee at a time. It caused more pain when many people needed to access the raised floor at once, but it was determined that this pain was offset by the gains in knowing exactly who was on the raised floor, and removing the possibility of someone letting any unauthorized people onto the floor. If an attacker gains physical access to the machine, then it’s game over; physically securing the machines is critical.

Tightening Security

Many data centers have locked cages around the servers. In the aforementioned scenario, even if a helpful employee helped you access the raised floor, you’d still need keys to the cages to work on the machines.

This isn’t limited to a raised-floor environment. Once I have access to a desktop machine I can add a small connector between the keyboard and the machine to log keystrokes and capture passwords. I can copy data onto a USB thumb drive. I can boot the OS into maintenance mode and make changes to allow me to access the machine in the future, or change the root password. All I need is a part-time job with a cleaning crew and many machines are vulnerable to attack.

Many people still use Telnet and FTP to access their machines. Both of these programs send their traffic unencrypted over the network. If I trace the network traffic on my machine I can easily capture cleartext passwords. I’d make it a priority to convert to SSH/SCP/SFTP so that the network traffic was encrypted.

SSH has its own problems. Many people like to set up public/private keys and allow themselves to access their servers without passwords. It can be convenient to use one master workstation to connect to all of the machines. By setting up public/private keys you may easily create wonderful tools that allow you to make changes across your environment instead of logging in to each machine individually. If you choose to do this, be sure to protect your private keys. If I steal your keys, I can log on as you. It would be better to create a passphrase and then use SSH-agent instead of having no passphrase at all. Again, you have to weigh the risks versus the benefits. See “References,” below.

On the public Internet, the number of attacks logged against port 22 has been rising. If your SSHD is listening on the public Internet, it might be worth changing the port it’s listening on. This will keep some automated scripts from attacking you, since you won’t be listening to the port that they expect you to, but this is also offset by the pain of notifying everyone that needs to use this newly changed port. This won’t help if the attacker port scans you and finds the newly assigned port, but may help defend against automated tools and unsophisticated attackers.

Final Tips

Run netstat –an more on your machine and look at all the ports that are listening. Do you know what every program is? If not, find out what starts that process and turn it off. Check in /etc/inetd.conf, /etc/inittab, /etc/rc.tcpip etc and turn off unneeded services. You can’t connect to a machine that isn’t listening.

Verify your path is set correctly and think before you type. I’ve heard of attackers that would change root’s path to have “.” at the beginning, which caused it to execute whatever was in the local directory first. Then they had to add a script or program to some directory and get root to run it. Depending on the administrator’s skill level, they might not even know that they just gave away root access to the machine.

Secure, Workable System

Monitoring logs, hardening machines and continually maintaining a secure posture can be painful for both system administrators and for users. You want to keep the data relatively easy to access for those whose jobs demand it, while keeping others out. In both scenarios, the goal is a secure workable system, either in the air or on the raised floor.

References

www.lbl.gov/cyber/systems/ssh.html
www.securityfocus.com/infocus/1806
www.cs.uchicago.edu/info/services/ssh

Analyzing Live Partition Mobility

Edit: This is taken for granted now. Some links no longer work.

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility. I saw it in action and I went from skeptic a believer in a matter of minutes.

Originally posted November 2007 by IBM Systems Magazine

I was in the Executive Briefing Center in Austin, Texas, recently for a technical briefing. It’s a beautiful facility, and if you can justify the time away from the office, I highly recommend scheduling some time with them in order to learn more about the latest offerings from IBM. From their Web site

“The IBM Executive Briefing Center in Austin, Texas, is a showcase for IBM System p server hardware and software offerings. Our main mission is to assist IBM customers and their marketing teams in learning about new IBM System p and IBM System Storage products and services. We provide tailored customer briefings and specialized marketing events.

“Customers from all over the world come to the Austin IBM Executive Briefing Center for the latest information on the IBM UNIX-based offerings. Here they can learn about the latest developments on the IBM System p and AIX 5L, the role of Linux and how to take advantage of the strengths of our various UNIX-capable IBM systems as they deploy mission-critical applications. Companies interested in On Demand Business capabilities also find IBM System p offers some of the most advanced self-management features for UNIX servers on the market today.”

While I was in Austin, one of the things that IBM demonstrated was how you can move workloads from one machine to another. IBM calls this Live Partition Mobility.

I saw it in action and I went from skeptic a believer in a matter of minutes. At the beginning, I kept saying things like, “This whole operation will take forever.” “The end users are going to see a disruption.” “There has to be some pain involved with this solution.” Then they ran the demo.

The presenters had two POWER6 System p 570 machines connected to the hardware-management console (HMC). They started a tool that simulated a workload on one of the machines. They kicked off the partition-mobility process. It was fast, and it was seamless. The workload moved from the source frame to the target frame. Then they showed how they could move it from the target frame back to the original source frame. They said they could move that partition back and forth all day long. (Ask your business partner or IBM sales representative to see a copy of the demo. There’s a Flash-based demo that was recorded to show customers a demo. I’m still waiting for it to show up on YouTube.)

The only pain that I can see with this solution is that the entire partition that you want to move must be virtualized. You must use a virtual I/O (VIO) server and boot your partition from shared disk that’s presented by that VIO server, typically a storage-area network (SAN) logical unit number (LUN). You must use a shared Ethernet adapter. All of your storage must be virtualized and shared between the VIO servers. Both machines must be on the same subnet and share the same HMC. You also must be running on the new POWER6 hardware with a supported OS.

Once you get everything set up, and hit the button to move the partition, it all goes pretty quickly. Since it’s going to move a ton of data over the network (it has to copy a running partition from one frame to another), they suggest that you be running on Gigabit Ethernet and not 100 Megabit Ethernet.

I can think of a few scenarios where this capability would be useful:

The next time errpt shows me I have a sysplanar error. I call support and they confirm that we have to replace a part (which usually requires a system power down). I just schedule the CE to come do the work during the day. Assuming I have my virtualization in place and a suitable machine to move my workload to, I just move my partition over to the other hardware while the repair is being carried out. No calling around the business asking for maintenance windows. No doing repairs at 1 a.m. on a Sunday. We can now do the work whenever we want as the business will see no disruption at all.

Maybe I can run my workload just fine for most of the time on a smaller machine, but at certain times (i.e., month end), I’d rather run the application on a faster processor or a beefier machine that’s sitting in the computer room. Move the partition over to finish running a large month-end job, then move it back when the processing completes.

Maybe it’s time to upgrade your hardware. Bring in your new machine, set up your VIO server, move the partition to your new hardware and decommission your old hardware. Your business won’t even know what happened, but will wonder why the response time is so much better.

What happens if you’re trying to move a partition and your target machine blows up? If the workload hasn’t completely moved, the operation aborts and you continue running on your source machine.

This technology isn’t a substitute for High Availability Cluster Multi-Processing (HACMP) or any kind of disaster-recovery situation. This entire operation assumes both machines are up and running, and resources are available on your target machine to handle your partition’s needs. Planning will be required.

This will be a tool that I will be very happy to recommend to customers.

Tips for Gaining Practical Systems Administrator Knowledge

Edit: Winning this contest opened doors for me.

When an individual seeks additional experience, the whole company may benefit as a result.

Originally posted April 2007 by IBM Systems Magazine

Note: As part of a collaboration between PowerAIX.org and IBM Systems Magazine, guest writers were invited to submit Tips & Techniques articles to be considered for publication. A panel decided Rob McNelly’s column, seen here, best met the contest’s criteria.

I work with an intern. He goes to school and comes to the office when he’s not studying for tests, working on homework or going to class. It’s fair to say that we subject him to some good-natured abuse. For example, we send him to the computer room to look up serial numbers or to verify that network cables are plugged in. When I ask him why he puts up with it, he tells me he’s grateful for the opportunity and will happily do anything in order to gain experience.

How else do entry-level people get their start in the industry? When I look at job listings I see plenty of opportunities for senior-level administrators with years of experience. I don’t see the same opportunities for novices that I once did. There seem to be fewer openings for people to start out on a help desk or in operations and then move their way up. It still happens, but many of those lower-skilled jobs are now being handled remotely from overseas.

Seek Practical Experience

Besides working as a paid intern, another method I’ve seen people use to gain practical experience is to get some older hardware from eBay and use that as their test lab. You don’t need the latest System p5* 595 server to learn how to get around the AIX* OS. An old machine might be slower, but it’s just fine to practice loading patches, getting used to working with the logical volume manager and learning the differences between AIX and other flavors of UNIX*. Mixing older RS/6000* machines with some older PCs running Linux* can give anyone a good understanding and exposure to UNIX without actually learning it on the job or spending a great deal of money.

People can also download Redbooks from IBM, and use those study guides to learn more. If they then get involved with a local user group, they can make connections with people who are usually willing to share their knowledge. Eventually, they have some basic knowledge, and can hope to land a position as a junior-level administrator.

This initiative to learn outside of work hours can prove invaluable. I know that if I interview someone who tells me he has little hands-on experience working in a large datacenter, but he’s shown that he’s ambitious enough to study and learn what he can on his own, I’m willing to take the chance that I can teach him the finer points of what he needs to know to do the job. Give me someone with a good attitude and a desire to learn, and he or she can usually be taught what’s necessary to be productive in my environment.

Senior-level administrators can give back by writing articles and answering questions. Personally, I’ve found some irc channels and some usenet groups that I respond to if I have time. If we want more people to learn about the benefits of using the AIX OS, then we should be willing to help them when they run into problems. Many people run Linux because it’s relatively easy to obtain and install. They know that they can go online and easily get help when they run into problems. That same type of community should be encouraged around the AIX world as well. The following are some tips that I’ve found helpful. Hopefully our intern finds them helpful as well.

Migrating Machines

When migrating machines to new hardware in the good old days, I would make a mksysb tape and take that tape over to the new server I was building. I would boot from the AIX CD, select that mksysb image and restore it to my new machine. As time goes on, I find it less common to see newer hardware equipped with tape drives. Much of my server cloning these days occurs using the network. Two tools that I rely on are NIM and Storix. I create my mksysb and move it to my NIM server or use the Storix Network Administrator GUI and create a backup image of my machine to my backup server. In either case, I just boot the machine that I want to overwrite, set up the correct network settings and install the image over the network. This can be a problem in a disaster-recovery situation if you haven’t made sure that these backup images are available offsite, but for day-to-day system imaging I’ve found both methods to be useful.

Sorting Slots

I know that some people have issues when looking at the back of a p595 server. It can be a chore when you want to know which slot is which. This can be important when creating several LPARs on the machine. You want to keep track of which Fibre card and which network card goes with which LPAR. Anything looks complicated until someone shows you how it works.

First, find the serial number of the drawers on the machine, as this is the information that’s displayed on the hardware-management console (HMC) and we’re trying to correlate the physical slot to what’s displayed on the HMC. I used a flashlight and looked at the left side of the front of my example machine. It has two drawers, in this case 9920C6A and 9920C7V.

When you go to the back of the machine, start counting your I/O cards from left to right. There will be four cards, a card you ignore, then six cards. These will be your first 10 slots on the drawer’s left side. There are four more cards, a card you ignore, and six more cards, making up the 10 slots on the drawer’s right side.

These slot numbers correspond with the slots you see when you select required and desired I/O components from the HMC. This I/O drawer had the following selections that I could choose from on the HMC (P1 is the left side of the top drawer, or the first 10 slots. P2 is the right side of the top drawer, or the second 10 slots.):

  • 9920C6A-P1
  • 9920C6A-P2

When I looked at the drawer, going from left to right, I wrote down:

  • C01 is an Ethernet card
  • C02 is a Fibre card
  • C03 and C04 are empty
  • C05 is a Fibre card
  • C06 is an Ethernet card
  • C07 is empty
  • C08 is a Fibre card
  • C09 is empty
  • C10 is a SCSI bus controller

So, I assigned C01, C02 and C05 from 9920C6A-P1 to this LPAR. If I continue the exercise and go to the right side of the top drawer, I start over with C01 and note which type of cards were in which slot. I then continue to do the same thing on the bottom drawer. In this way, I know exactly which cards are in which slot, and it’s simple to assign them to the particular LPAR in which you want them. For redundancy, I’ve heard recommendations state that you take one Fibre card from your top drawer and another Fibre card from your bottom drawer. This way you will still have a path to the SAN if you were to lose one of the drawers.

A Fresh Perspective

Another thing I like to do when I bring in new employees is to have them look for what we’re doing wrong. The new employee has fresh eyes. They don’t know that “this is the way we always do things around here.” They see a piece of documentation, tool or a process, and can question why things are done.

In some cases, there are perfectly good reasons why things are being done a certain way and you can explain them. In other cases, there’s no good reason, other than it’s the way things have always been done. Instead of trying to get them up to speed and make them do it the company way, let them ask you to defend why the company does things this way in the first place. Maybe their previous employer had a much better method that they used to get things done. This is a great time to learn from each other to improve the environment.

We can try to help make a difference in a newer AIX administrator’s career. However, that doesn’t mean we’re the fountain of all information. I’ve found a time or two when an intern has asked why we do things a certain way, and I didn’t have a good answer. I told him to figure out a better way, and come back and inform the group. This helps him with his knowledge of where to look for information and it has helped us all think about processes and procedures that we’ve taken for granted.

The intern’s learning is an example of the boon a little practical knowledge can make. When an individual seeks additional experience, the whole company may benefit as a result.

Establishing Good Server Build Standards, Continued

Edit: Still useful information.

Standards and checklists can take effort to maintain but, once in place, all of your builds look identical.

Originally posted January 2007 by IBM Systems Magazine

Note: This is the second of a two-part article series. The first part appeared in the December, 2006 EXTRA.

In my first part of this article series, I explained the importance of establishing good server build standards, along with a mechanism to enforce those standards. I also explained the importance of putting in place a checklist to ensure the standards are met in a consistent manner. This second article installment looks further into server build standards.

The Benefits of a Good Server Build

Standards and checklists can take a great deal of effort to maintain but, once in place, all of your builds look identical. The actual time it takes to deploy a server is minimized. Administrators are then free to work on other production issues instead of spending a great deal of time loading machines. When you’re only deploying a server once in a while, this might not be a big deal. But when you start to deploy hundreds of machines, your application support teams are going to appreciate your consistency. If two administrators are building machines – and each machine looks slightly different – your end users won’t know exactly what to expect when a machine gets turned over to them. They spend time asking for corrections and additions to the machine that should have been completed when the server was loaded. This makes your team’s work product look shoddy, as you’re not consistently delivering the same end product.

People who use the machines should come to expect the new server builds will have the same tools and settings as all the other machines in the environment. When the database team is ready to load their software, and find file systems or userids missing, it makes their job more difficult, as they’re not sure what to expect when they first get access to the machine.

Be Consistent

Besides the server builds, the actual carving up of LPARs should be consistent. Sure, some machines might be using “capacity on demand,” and some might want to run capped or uncapped, but when these decisions are made, document them so people know what to expect in the different profiles. If you explain why you chose the setting, people are less likely to change it. Likewise, if you tell them why you chose shared processors and why the minimum and maximum number of processors look the way they do, they’ll be less likely to mess with it.

So how do you get people to actually follow the standards and documentation you’ve been maintaining? Make sure it’s easy to follow. The new person who joins your team should be able to quickly get up to speed on what you’re doing and why. This will make them a more effective member of the team in less time. When people make mistakes, or blatantly ignore the standards, call them out on it; maybe privately at first, but if it continues, I think the whole team should be made aware that a problem exists. Maybe there’s a good reason the standards aren’t being followed. Maybe they’ve been to class and learned something new. If this is the case, there should be some discussion and consensus as to how the documentation and standards should change.

The documentation your team maintains should obviously focus on more than just server builds. The more quality procedures and documentation you can create, the easier your job is going to be over the long term. If you have a well written procedure, you can easily remind yourself of what you did six months ago, and which files you need to change and which commands you need to run to make changes to the system today.

Some members of the team have stronger documentation skills than others. Some members of the team may have very strong technical skills, but their writing skills may not be as effective. This shouldn’t automatically get people off the hook, but if they really don’t have a good grasp on the language, or just have problems getting documentation onto paper, maybe they need to work together with someone who has more skill in that area. Maybe there needs to be a dedicated resource that works on creating and maintaining the documentation. Obviously every team will be different. The key to making effective use of documentation is to make it easily available (especially when on call or working remotely) and making it easily searchable.

When you are able to quickly and easily search for documentation, and everyone knows exactly where it is, it is more apt to be used. Instead of reinventing the wheel, people should be able to quickly find the material they need to do their jobs. In some cases, a very brief listing of necessary commands may be very helpful in troubleshooting a problem. It’s also helpful to have a good overview of common problems, how things should behave normally, and where to go for further information if they’re still having problems.

Once the documentation and the golden image are in place, your team can start looking for other ways to automate and enhance the environment they work in. There are always better ways to do things. Just because something is the way things are done today doesn’t mean it’s the best way to get things done. With an open mind, and a fresh set of eyes, sometimes we can more easily see the things around us that could use improvement. Then it’s just a question of making the time to make things happen. Sometimes it requires small steps, but with a clear vision of how things should look, we can make the necessary adjustments to make things better.

Establishing Good Server Build Standards

Edit: This can be less of an issue when things are more automated, but it is still worth consideration.

Server build standards simplifies the process of supporting IT environments.

Originally posted December 2006 by IBM Systems Magazine

Note: This is the first of a two-part article series. The second part will appear in the January EXTRA.

There are still small organizations with one or two full time IT professionals working for them. They may find they are able to make things work with a minimum of documentation or procedures. Their environment may be small enough that they can keep it all in their heads with no real need for formal documentation or procedures. As they continue to grow, however, they may find that formal processes will help them, as well as the additional staff that they bring on board. Eventually, they may grow to a point where this documentation is a must.

The other day I was shutting down a logical partition (LPAR) that a co-worker had created on a POWER5 machine. A member of the application support team had requested we shut down the LPAR as some changes had been made, and they wanted to verify everything would come up cleanly and automatically after a reboot. We decided to take advantage of the outage and change a setting in the profile and restart it. To our surprise, after the LPAR finished its shutdown, the whole frame powered off. When you go into the HMC and right-click on the system, and select properties, you see the managed system property dialog. On the general tab, there’s a check-box that tells the machine to power off the frame after all logical partitions are powered off. During the initial build, this setting was selected, and our quick reboot turned into a much longer affair as the whole frame had to power back up before we were able to activate our partition. This profile setting had not been communicated to anyone, and we had mistakenly assumed it was set up like the other machines in our environment.

This scenario could have been avoided had there been good server build standards in place, along with a mechanism to enforce those standards. Our problem wasn’t that the option was selected, but that there was no good documentation in place that specified exactly what each setting should look like and why. Someone saw a setting and made their best guess as to what that setting should be, and then that decision was not communicated to the rest of the team. One of the problems with having a large team is people can make decisions like these without letting others know what has taken place. Unless they have told other people what they’re doing, other members of the team might assume the machine will behave one way when, in actuality, it’s be set up in another way.

Making A List, and Checking It Twice

Checklists and documentation are great, as long as people are actually doing what they are supposed to. Some shops have a senior administrator write the checklist, and a junior administrator build the machine, while another verifies the build was done correctly. A problem can crop up when a senior administrator asks for something in the checklist without explaining the thinking behind it. He understands why he has asked for some setting to be made, or some step to be taken, but nobody else knows why it’s there. The documentation should include what needs to change, but also why it needs to be changed. If it’s clear why changes are made, people are more apt to actually follow through and make sure all the servers are consistent throughout the environment. If the answer they get is to just do it, they might be less likely to bother with it since they don’t understand it anyway. The person actually building the machine might not think it’s important to follow through on, which leads to the team thinking a server is being built one way, when the finished product does not actually look the way the team as a whole thought it would.

The team also needs to be sure to keep on top of the checklist, as this is a living document that will be in a constant state of flux. As time goes on, if that checklist is not kept up to date, things can change with the operating system and maintenance level patches that either make that setting obsolete, or the setting starts causing problems instead of fixing them. The decision could have been made to deploy new tools, or change where logfiles go, or standard jobs that run out of cron. If these changes are not continually added to the checklist, the new server builds no longer sync with those in production. This is equally important when decommissioning a machine. There are steps that must be taken, and other groups that need to be notified. The network team might need to reclaim network cables and ports. The SAN team may need to reclaim disk and fiber cables. The facilities team may need to know this power is no longer required on the raised floor. To put it simply: A checklist that’s followed can ensure these steps get completed. Some smaller shops may not have dedicated teams to do these things, in which case it might just be a case of reminding the administrators they need to take care of these steps.

Another issue can crop up when the verifier is catching problems with the new server builds, but isn’t updating the documentation to help clarify settings that need to be made. If the verifier is consistently seeing people forgetting to change a setting, they should communicate what’s happening to the whole team, why it needs to happen during the server build, and then update the documentation to more clearly explain what needs to be done during the initial server build. What’s the point of a verifier catching problems all the time, but then not making sure the documentation is updated to avoid these problems from cropping up in the future?

Having these standards makes supporting the machines much easier, as all of the machines look the same. Troubleshooting a standard build is much easier, as you know what filesystems to expect, how volume groups are set up, where the logs should be, what /etc/tunables/nextboot looks like, and so on. Building them becomes very easy, especially with the help of a golden image. I think it’s essential to have infrastructure hardware you can use to test your standard image. This hardware can be dedicated to the infrastructure or an LPAR on a frame but, in either case, you set up your standard image to look exactly as you want all of your new servers to look, and make a mksysb of it. Then use that on your NIM server to do your standard loads. Instead of building from CD, or doing a partial NIM load with manual tasks to be done after the load, keep your golden image up to date and use that instead. Keep the manual tasks that need to happen after the server build to an absolute minimum, which will keep the inconsistencies to a minimum as well. When patches come out, or new tools need to be added to your toolbox, make sure – besides making that change to the production machines – you’re updating your golden image and creating a more current mksysb.

In next month’s article, I’ll further explore the benefits of establishing good server build standards and checklists.

Of Cubes, Offices and Remote Access Via VPN

Edit: I still believe this is true.

A system administrator’s take on getting the most from the work day.

Originally posted November 2006 by IBM Systems Magazine

Last month I looked at reasons why a VPN is a great idea for accessing your network when you are not in the office. This article examines issues I’ve encountered when working in a cube farm, and different methods I like to use when trying to get continuing education while training budgets continue to get squeezed.

When your cell phone goes off in the middle of the night and you find that a system is down and requires your attention, does your employer require you to get dressed and drive to your workplace to take care of the situation? In some environments, that answer is yes. For whatever reason, a VPN may not be allowed into the network and you must drive on site to resolve the issue. In other cases, you may have a hardware failure and no tools are available to remotely power machines on and off. Maybe you are having issues bringing up a console session remotely, and you have to drive on site. Generally, however, in most situations we are able to log in and resolve the issue without leaving the comfort of our homes.

Many companies encourage their employees to resolve issues from home as the response time is much quicker, and they hope the employee can quickly resolve the issue, get some sleep, and still be able to make it into the office for their regular hours during the day. However, the flexibility that these companies show during off hours often is not extended during daylight hours; the belief apparently being that an employee who they can’t see in the office must not actually be working.

I have worked in environments where you needed to be on site to mount tapes and to go to the users’ workstations to help them resolve computing issues they might be experiencing. There are also times that you need to be on a raised floor to actually access hardware, or you might be asked to attend a meeting in person. For the most part, much of the day-to-day work of a system administrator can be handled remotely.

When tasks are assigned to team members via a work queue, and when you are able to communicate with coworkers via e-mail and instant messaging (and a quick phone call to clarify things once in a while) there is no reason, in my opinion, to come on site every day. Some shops, however, want everyone to work in cubicles, and have everyone available during the same hours. They feel this will lead to more teaming and quicker responses from co-workers. What I’ve found in these situations is the opposite.

The Cacophony of the Cube Farm


It gets very noisy in a cube farm, and there is a great deal of socializing that takes place throughout the day. Some people try to solve the issue by isolating themselves with noise canceling headphones and hope that they can get some “heads down” time to work on issues. Instead of being part of the environment, they’re isolated and can’t hear what’s happening around them. People can still interrupt them by tapping them on the shoulder, but I find that it’s more efficient to contact them electronically instead of in person.

Cube farms easily lend themselves to walk-up requests from other employees who sit in the same building. Most organizations do their best to have change control and problem reporting tools to manage their environments. When coworkers try to short circuit the process and walk up to ask for a quick password reset or a failed login count reset, or to quickly take a look at something, it can cause problems.

Some people follow the process and open a ticket in the system, or they call the helpdesk. The helpdesk opens a ticket and assigns it to a work queue. The people who walk up to the cube bypass that whole process for a quick favor. It may not take very long to help them out, but it does cause issues. The person who granted the favor was interrupted and lost their concentration, and possibly stopped work on a high severity or mission critical situation.

The person who walked up also stopped what they were doing, walked over, waited to get your attention, and then waited while you worked on their problem. This prevented you from working on the problem you had already committed to getting done. There was no record in the system that this issue came up, which in some environments can lead to an under reporting of trouble tickets, which can cause management to believe that there are less requests being fulfilled than are actually occurring. When you ask them to go back and fill out a form or call the helpdesk, they can get upset that you did not immediately help them out. If you ask them to open a ticket after the fact, that becomes a hassle for them to take care of, and they have no real motivation to go back and take care of the paperwork, as their request has already been handled for them.

What I’ve found works better for me is to work remotely during the day. The interrupts still come in via instant messaging or e-mail, but I can control when I respond to them. During an event that requires immediate assistance, I can easily be paged or called on my cell phone. Just because an e-mail or an instant message comes in, that doesn’t mean I have to immediately stop what I’m working on in order to handle it. I can finish the task I’m working on, and when I reach a good stopping point, I can find out what the new request is. Depending on the severity of the request, and how long it will take, I can then prioritize when it will need my attention.

I also find that since my coworkers are not standing there waiting for me to respond, there is less time wasted by both parties. They send me an e-mail or instant message, and go on doing other things while waiting for me to respond. If it’s appropriate, I have them open a ticket and get it assigned to the correct team to work on it. For some reason, the request to have them open a ticket has been met with less hostility when I have done it over instant messaging versus a face-to-face discussion.

Offices Versus Cubicles


My next favorite place to work, if I must be onsite, is an actual office with a door that I can shut. Many companies have gravitated away from this arrangement due to the costs involved, but I think it bears some reconsideration. The noise levels in a shared office environment end up irritating a good portion of the employees. Office mates that use the phone can be heard up and down the row. Some employees want less light, some want more. Some want less noise, some want to listen to the radio and shout over the cubicle partitions to get their neighbor’s attention. All the background noise and the phone conversations make it very difficult to concentrate when working on problems.

There can be advantages to a shared work environment. When you overhear an issue that a coworker is working on, for example, you may be able to offer some help. Other times, it can be conducive to a quick off the cuff meeting with people. You can quickly look around and determine if someone is in the office that day. Some people thrive in a noisy environment, and it often all comes down to personal style and how people work best. I think many companies would be well served to offer options to their employees.

In discussing this topic with co workers, there are some who would refuse to work from home, since they may not feel disciplined enough to get work done in that environment and they would miss the interpersonal interaction that they currently enjoy. I’ve heard some say they would feel cooped up in an office and need the stimulation that comes from having their coworkers around. But, for some, the ability to work remote or to work in an actual office makes for a happier and far more productive employee.

Setting up work environments has gotten so bad at times, I have seen companies set up folding tables with a power outlet and a network switch and asked people to work in that space. I suppose for a quick ad hoc project, or a disaster recovery event, this may make some sense, but to ask people to work this way day in and day out seems almost inhumane. At least with a cubicle you have some semblance of walls, but in this arrangement employees are sitting shoulder to shoulder, and I honestly have no idea how they can even begin to think about getting things done.

Flexible Hours


Along with the ability to work remotely, I also enjoy the ability to work flexible hours. If you are working on projects, does it really matter what time of the day you work on them? I have enjoyed the flexibility of working in the morning, taking my kids to school, working more after that, taking a break around lunch time and going to the gym or out for a bike ride, then working more after that. I’ve found that I actually worked longer hours, but I didn’t mind since I was setting my own schedule and determining what time of day was most appropriate to work on the tasks at hand. Some people work better later in the evening, so why not let them work then?

Why be expected to work from 9 a.m. to 5 p.m. when 6 a.m. to 9 p.m. may work better for workers, with some breaks during the day to attend to personal matters? Some managers insist they can’t effectively supervise their employees if they don’t constantly have them around to monitor. I say this is nonsense; you can very easily tell if your employees are doing their job based on the feedback you get from people who are asking them to do work. Are they closing problem tickets? Are they finishing up the projects assigned to them? Are they attending their meetings and conference calls? Are they responsive to e-mail? If so, who cares what time of day or location the employee happened to be working from?

Training Time


Another difficult thing to do in a noisy environment is simply read and concentrate. With training budgets getting cut, many employees find that, to keep their skills current, they must constantly read and try things on their own in test environments. IBM Redbooks and other online documentation may be all the exposure that people get with topics like virtualization or HACMP or VIO servers. Having a quiet place to study, while having access to a test machine, can do wonders as far as training goes, although it doesn’t offer the same depth you can get when you are able to go to a weeklong instructor-led class. But, it’s usually better than computer-based training (CBT), in my opinion.

Hands-on lab-based training should be built into the job. The opportunities should be made available to those who want to keep their skills current, even if the training budget isn’t there. Companies should make sure employees are given the time to study these materials, even if the funding isn’t available to allow them to go to formal classes.

Many companies have told me they have given me an unlimited license to use all of the CBT courses I could take, at a huge cost savings to the company. When I looked at the course catalog, it was definitely a case of them getting what they paid for. Many times, the classes contained older material, and it was usually at an inappropriate skill level. At least with Redbooks and a test machine, you can quickly find out if you are able to get the machine to do what you think it should.

Employee Retention is Key


They say the cost of employee turnover can be significant. Instead of spending all the money to recruit and train someone new, I am always amazed that a company is not more interested in retaining the talent that they already have. The environment where people spend many of their waking hours will have an impact on whether companies are able to recruit new talent, and retain the talent they already have.

By taking steps to make the work environment less distracting, companies will likely realize a more productive workforce. If this means providing employees with their own office, then it will be money well spent. If this means letting them work remotely, that will also be a good solution. Be sure to encourage them to schedule the time in their day to read and study and try things out in a lab setting. As they gain more skill and have a quiet environment to work in, the company will find an energized and motivated pool of talent to call upon to implement their next project.

Advice for the Lazy Administrator

Edit: Still good stuff.

Originally posted September 2006 by IBM Systems Magazine

I always liked the saying that “a lazy computer operator is a good computer operator.” Many operators are always looking for ways to practically automate themselves out of a job. For them, the reasoning goes: “why should we be manually doing things, if the machine can do them instead?”

A few hours spent writing a script or tool can pay for itself very quickly by freeing up the operator’s time to perform other tasks. If you set up your script and crontab entry correctly, you can let your machine remember to take care of the mundane tasks that come up while you focus on more important things, with no more forgetting to run an important report or job. Sadly, even the best operator with the most amazing scripts and training will need help sometimes, at which point it’s time for the page out.

In our jobs as system administrators, we know we’re going to get called out during off hours to work on issues. File systems fill up, other support teams forget their passwords or lock themselves out of their accounts at 2 a.m., hardware breaks, applications crash. As much as we would love to see a lights out data center where no humans ever touch machines that take care of themselves, the reality is that someone needs to be able to fix things when they go wrong.

We hate the late night calls, but we cope with them the best we can. Hopefully management appreciates the fact that many of us have families and lives outside of work. We are not machines, or part of the data center. We can’t be expected to function all day at work, then all night after getting called out. It’s difficult to get back to sleep after getting called out, and it impacts our performance on the job the day after we are called or, worse, it ruins our weekends. However, our expertise and knowledge are required to keep the business running smoothly with a minimum of outages, which is all factored into our salaries.

I have seen different methods used, but it’s basically the same. Each person on the team gets assigned a week at a time, with some jockeying around to try to schedule our on-call weeks to avoid holidays, and usually people can work it all out at the team level. In one example, I even saw cash exchanging hands to ensure that one individual was able to skip his week. Whatever method is used, the next question revolves around how long you’re on call. Is it 5 p.m. – 8 a.m. M-F and all day Saturday and Sunday? Is it 24 x 7 Monday through Monday? Does the pager or cell phone get handed off on a Wednesday? Do we use individual cell phones or a team cell phone? They are all answers to the same question, and you have to find the right balance for the number of calls you deal with off-shift and the on-call workload during the day.

On call rotation is the bane of our existence, but we can take steps to reduce the frequency of the late night wake up calls. If we have stable machines with good monitoring tools and scripts in place, that can go a long way towards eliminating unnecessary callouts. Having a well-trained first-level support, operations, or help desk staff can also help eliminate call outs.

In a perfect world, a monitoring tool like NetView or OpenView or Netcool is in place monitoring the servers, where all of the configurations are up to date and all of the critical processes and filesystems are being monitored. When something goes bad, operations sees the alert, and they have good documentation, procedures and training in place to do some troubleshooting. Hopefully they’ve been on the job for a while and know what things are normal in this environment, and they can quickly identify when there is a problem. For routine problems, you have given them the necessary authority (via sudo) or written scripts for them to use to reset a password, reset a user’s failed login count, or even add space to a filesystem if necessary.

I spent time in operations early in my career, and learned a great deal from that opportunity. I remember it was a great stepping stone: many of my coworkers got their start working 2nd and 3rd shift in operations positions. This was a great training ground, but all of the good operators were quickly “stolen” to come work in 2nd and 3rd level support areas.

If another support team needs to get involved, operations pages them and manages the call. Then the inevitable happens: someone needs to run something as root, or they need our help looking at topas or nmon, etc. Hopefully they were granted sudo access to start and stop their applications, but sometimes things just are not working right, and that’s when they page the system administrator. Ideally, by the time we’ve been paged, first level support has done a good job with initial problem determination, the correct support team has been engaged, and by the time they get to us, they know what they need for us to do and it will be a quick call and we can go back to sleep.

Sometimes, it’s not a quick call, where nobody knows what’s wrong and they’re looking to us to help them determine if anything is wrong with the base operating system. In a previous job, I used a tool that kept a baseline snapshot of what the system should look like normally. It would know what filesystems should be mounted, what the network looked like, which applications were running and saved that information to a file. When run on the system in its abnormal state, it was easy to see what was not running, which made finding a problem very simple. Sometimes, however, this did not find anything either, which is where having a good record of all the calls that have been worked on by the on-call team is a godsend. A quick search for the hostname would bring back hits that could give a clue as to problems others on your team had encountered, and what they had done to solve them.

At some point, the problem will be solved, everyone will say it’s running fine, and everyone will hang up from the phone call (or instant messaging chat, depending on the situation) and go to bed. Hopefully, as the call was ongoing, you were keeping good notes and updating your on call database with the information that will be helpful to others to solve the problem in the future. Just typing in “fixed it” in the on call record will not help the next guy who gets called on this issue nine months down the road.

Hopefully you are having team meetings, and in these meetings you are going over the problems your team has faced during the last week of being on call, and the solutions that you used to solve them. There should be some discussion as to whether this is the best way to solve it in the future, and whether any follow-up needs to happen. Do you need to write some tools, or expand some space in a file system, or to educate some users or operations staff? Perhaps you need to give people more sudo access so they can do their jobs without bothering the system admin team.

Over time, the process can become so ingrained that the calls decrease to a very manageable level. Everyone will be happier, the users will have machines that don’t go down, and if/when they do, operations is the first team to know about it. The machines can be proactively managed which will save the company from unnecessary downtime.

The Benefits of Working Remotely Via VPN

Edit: Hopefully this problem has been solved by now.

Originally posted October 2006 by IBM Systems Magazine

It’s 2 a.m., and you’ve just been paged. Do you have an easy way to get into your network, or is the pain of waking up going to be compounded by frustrations associated with dialing into work? In the good old days, I can remember dialing into work with a modem in order to get work done. It was slow, but there weren’t any alternatives. I just thought I was lucky I could avoid the drive back onsite to fix something in the middle of the night.

Sometimes I would use a package like Symantec’s pcAnywhere to remotely control a PC that had been left powered on in the office. We would use this same type of solution for our road warriors, who would dial in from a hotel room and do their best to get their e-mail or reports from the server. It wasn’t ideal, but it was one of the best solutions available at the time. Some employers still use solutions like pcAnywhere, gotomypc.com, Citrix, etc. These approaches can be useful for non-technical users, or for people that need to use desktops that are locked down. However, with the advent of the ability to tunnel over a virtual private network (VPN) into the corporate network, the need to use remote control software should lessen, especially for the technical support staff members who happen to be remote.

The need to be remote might not even be related to a call out in the middle of the night. You might have employees who travel and need to access the network from a cab, airport or hotel. You may be interested in offering the ability for your employees to work remotely and require them to be in the office less often. You may have an employee who is too sick to come into the office, but not so sick that they cannot take some Dayquil and do some work from home. You may have an employee with a sick child who is unable to go to daycare. Instead of asking them to take a sick day to care for their child, hopefully you have the tools and policies in place to allow them to work remotely while their child is resting. All of these situations end up being productivity gains for the employer. Instead of idle time during which an employee is unable to connect to the office and get work done, a simple VPN connection into the office gives the employee the opportunity to get things done from wherever they are, using the tools they’re accustomed to.

I have known customers that outfit their employees with laptops that allow them to work from home, but then cripple them with a Citrix solution, or another remote access method that doesn’t allow them to use the tools that are on their machines. It’s much easier for the employee to use the applications that are loaded on the laptop, in the same way that they are used in the office. When you put another virtual desktop in the middle of things, it complicates life unnecessarily compared to allowing this machine to be just another node on the network.

Security Considerations and Precautions

There are security considerations and precautions that need to be taken when thinking about a VPN. Nobody wants to deploy a solution that allows their employees in, but also allows non-employees to have unauthorized access. We must do our best to mitigate these risks, while still allowing trusted people to have the resources to do their jobs. There are going to be some networks that don’t allow any traffic in or out of them from the outside, and obviously this discussion is not intended for them. There are going to be situations where sensitive information exists where the risk of disclosure outweighs any benefits of allowing remote access to anyone.

In many instances, providing employees with network access is a benefit to the employee and the employer. The time it will take to wait for an employee to get dressed and drive in (especially when they live great distances away) can be an unacceptable delay when a critical application goes down during the night. Instead of waiting for them to drive on-site, provide the right tools to get the job done remotely.

An ideal world is one where you can work seamlessly from wherever you happen to be. Cellular broadband networks, 802.11 wireless networks, and wired broadband networks in the home, coupled with a decent VPN connection, has gotten us to the point where it really doesn’t matter where an employee physically resides in order to get the work done. We can see the truth of that statement when we start to see the globalization of the technical support work force. Many organizations are taking advantage of the benefits of employees working from anywhere, including other countries. It would be ridiculous to ask an employee to work remotely from overseas over a Citrix connection that has a 15-minute inactivity timeout. It should be just as ridiculous to ask a local employee to use this type of connection to troubleshoot and resolve issues with servers.

Using What You’re Familiar With

When you need to connect to your hardware management console (HMC) from home, it’s nice to run WebSM the same way you do in your office. You could run Secure Shell (ssh) into the HMC as hmcroot, and run vtmenu. From there, you enter the correct number for the managed system you want to use, and then type the number of the LPAR you want to open a console window for. This is fine, but sometimes you need to use the GUI to do work on the profiles or to stop and start LPARs.

Why not just use the tools and methods you’re familiar with and use in the office? I’ve worked both ways, and being able to suspend your laptop, go somewhere else, restart it, connect to the VPN, and pick up right where you left off by using virtual network computing (VNC) is a great way to work. If you have your instant messenger running in a VNC session, it can be so seamless that your coworkers may not even realize that you have moved physical locations – they just noticed that you did not respond to them for a while, and you did not have to interrupt the flow of the chat session that was in progress.

Being asked to use a Citrix-like solution that is clunky by comparison (especially if there are issues with the Citrix connection being lost, or timing out too quickly) can quickly make employees not as eager to take care of problems from home. Instead of quickly and easily connecting to the network and solving the problem, you have people wasting time trying to use a difficult solution.

When I use a seamless VPN connection, I actually find that I work more hours. It’s so easy to get online, I constantly find myself doing work before and after my hours on-site, and even doing things on the weekends. Checking e-mail, looking at server health-check information and checking the on-call pager logs are all so easy to do, I figure why not spend a few minutes and do them. When I contrast that with a solution that’s painful to use, I see that people are not nearly as interested in getting online and getting things done, and things are only done as a last resort in a situation where they have to get online to fix something that’s broken.

VPN Options

I have used commercial VPN offerings, including the AT&T network client, the IBM WebSphere Everyplace Connection Manager (WECM), and open source offerings including OpenVPN. There are pros and cons with all of them, but the main thing that they shared was the capability to make your remote connection replicate the look and feel of your office environment while you’re away from the office.

One aspect of the AT&T client that I liked was the capability to go between using dial-up access when you could not find broadband access, or going over a broadband connection when you could. Obviously, the speed differential was tremendous, but the capability to dial in when there is no other way to make a connection was very helpful while traveling.

When I used a WECM gateway, I found I was able to be connected on a wireless network, suspend my laptop, go to a wired network, take my laptop out of hibernation, and have the network connections re-establish themselves over the new connection. This made things even more seamless and transparent to the end user.

As this IBM Web site explains: “IBM WebSphere Everyplace Connection Manager (WECM) Version 5.1 allows enterprises to efficiently extend existing applications to mobile workers over many different wireless and wireline networks. It allows users with different application needs to select the wireless network that best suits their situation. It also supports seamless roaming between different networks. WECM V5.1 can be used by service providers to produce highly encrypted, optimized solutions for their enterprise customers.”

“WECM V5.1 is a distributed, scalable, multipurpose communications platform designed to optimize bandwidth, help reduce costs, and help ensure security. It creates a mobile VPN that encrypts data over vulnerable wireless LAN and wireless WAN connections. It integrates an exhaustive list of standard IP and non-IP wireless bearer networks, server hardware, device operating systems, and mobile security protocols. Support for Windows Mobile V5 devices clients has now been added.”

Both of these solutions cost money, so a low cost method is to set up a Linux machine as an OpenVPN server. A full discussion is beyond the scope of this article, but more information can be found at openvpn.net. From that site’s main page: “OpenVPN is a full-featured SSL VPN solution that can accommodate a wide range of configurations, including remote access, site-to-site VPNs, WiFi security, and enterprise-scale remote access solutions with load balancing, failover, and fine-grained access-controls.”

“OpenVPN implements OSI layer 2 or 3 secure network extension using the industry standard SSL/TLS protocol, supports flexible client authentication methods based on certificates, smart cards, and/or 2-factor authentication, and allows user or group-specific access control policies using firewall rules applied to the VPN virtual interface. OpenVPN is not a Web application proxy and does not operate through a Web browser.”

The competition for talent in today’s IT world is fierce. During the interview process, when a potential candidate asks you about the solution that you use for working from home and on call support connectivity, hopefully you can give them the right answer. With the right infrastructure in place, it may even be possible to recruit talent and allow them to continue living where they are, instead of asking them to relocate.

Most organizations already have good solutions in place, but it never hurts to revisit the topic, and see if there is room for improvement where you work.

Real World Disaster Recovery

Edit: One of my favorite articles.

Originally posted June 2006 by IBM Systems Magazine

Disaster recovery (D/R) planning and testing has been a large part of my career. I’ve never forgotten my first computer-operations position and the manager who showed me a cartoon of two guys living on the street. One turned and said to the other, “I did a good job, but I forgot to take good backups.”

I’ve been involved in D/R exercises for a variety of customers, and I was also peripherally involved on a D/R event that happened after Hurricane Katrina. There’s a big difference between planned and unplanned D/R events.

Does your datacenter have the right procedures and equipment in place to recover your business from a disaster? Can your business survive extended downtime without your computing resources? Is your company prepared for a planned D/R event? What about an unplanned event? I’ve helped customers recover from both types of events. This article provides a place to start when considering D/R preparations for your organization.

Comfortable Circumstances

There’s a big difference between planned and unplanned D/R events. After traveling to an IBM* Business Continuity and Recovery Services (BCRS) center, I helped restore 20 AIX* machines during the 72 allocated hours. I was well-rested and well-fed. We knew the objectives ahead of time, and we took turns working and resting. Additionally, we didn’t restore all of the servers in the environment, but hand-picked a cross-section of them. We modified, reviewed and tested our recovery documentation before we made the trip, and we made sure there was enough boot media to do all the restores simultaneously – and even cut an extra set of backup tapes just in case.

We had a few minor glitches along the way, but we were satisfied that we could recover our environment. However, these results must be taken with a grain of salt, as this whole event was executed under ideal circumstances.

In another exercise, I didn’t have to travel anywhere; I went to the BCRS suite at my normal IBM site and spent the day doing a mock D/R exercise. We were done within 12 hours. We had a few minor problems, but the team agreed that we could recover the environment in the event of an actual disaster. Again, I was well-rested and well-fed.

Katrina Circumstances

As Hurricane Katrina was about to make landfall, e-mails went out asking for volunteers to help with customer-recovery efforts. I submitted my name, but there were plenty of volunteers, so I wasn’t needed. A few weeks later, the AIX admin that had been working on the recovery got sick, and I was asked to travel onsite to help.

Although I can’t compare the little bit that I did with the Herculean efforts that were made before I arrived, I was able to observe some things that might be useful during your planning.

A real D/R was much different from the tests that I’d been involved with in the past. The people worked around the clock in cramped quarters, getting very little sleep. There were too many people on the raised floor, and there weren’t enough LAN drops for the technicians to be on the network simultaneously.

The equipment this customer was using needed to be refreshed, so there was an equipment refresh along with a data recovery, which posed additional problems during the environment recovery. Fortunately, the customer had a hot backup site where the company could continue operations while this new environment was being built. However, as is often the case, the hot backup site had older, less powerful hardware. It was operational – but barely – and we wanted to get another primary site running quickly.

One of the obvious methods of disaster preparation is to have a backup site that you can use if your primary location goes down. Years ago, I worked for a company that had three sites taking inbound phone calls. They had identical copies of the database running simultaneously on three different machines. They could switch over to the other sites as needed. During the time I was there, we had issues (snow, rain, power, hardware, etc.) that necessitated a switch over to a remote location. We needed to bring down two sites and temporarily run the whole operation on a single computer. This was quite a luxury, but the needs of the business demanded that was the route to be taken. This might be something to consider as you assess your needs.

Leadership must be established before beginning – either during a test or a real disaster. Who’s in charge: the IBM D/R coordinator, the customer or the technicians? And which technicians are driving the project: the administrators from the customer site, consultants or other technicians? All of these issues should be clearly defined so people can work on the task at hand and avoid any potential political issues.

The Importance of Backups

During my time with the Katrina customer recovery, I found out that one of the customer’s administrators had to be let go. It turns out that he’d been doing a great job with his backup jobs. He ran incremental backups every night, and they ran quickly. However, nobody knew how many years ago he’d taken his last full backup. The backup tapes were useless. Fortunately, their datacenter wasn’t flooded and, after the water receded, they were able to recover some of their hardware and data.

Are your backups running? Are you backing up the right data? Have you tested a restore? One of the lessons we learned during a recovery exercise was that our mksysb restore took much longer than our backup. Another lesson we learned was that sysback tapes may or may not boot on different hardware. Does your D/R site/backup site have identical hardware? Does your D/R contract guarantee what hardware will be available to you? Do you even have a D/R contract?

Personnel Issues

We had personnel working on this project who were from the original customer location and knew how to rebuild the machines. However, they were somewhat distracted as they worried about housing and feeding their families and finding out what had happened to their property back home. Some were driving hundreds of miles to go home on the weekend – cleaning up what they could – and then making the long drive back to the recovery site. Can you give your employees the needed time away from the recovery so they can attend to their personal needs? What if your employees simply aren’t available to answer questions? Will you be able to recover?

Other Issues

Other issues that came up involved lodging, food and transportation. FEMA was booking hotel rooms for firefighters and other rescue workers, so finding places to stay was a challenge. For a time, people were working around the clock in rotating shifts. Coordinating hotel rooms and meals was a full-time job. Instead of wasting time looking for food, the support staff brought meals in and everyone came to the conference room to eat.

You may remember that Hurricane Rita was the next to arrive, so there were fresh worries about what this storm might do, and gasoline shortages started to occur. After you’ve survived the initial disaster, will you be able to continue with operations? I remember reading a blog around this time about some guys in a datacenter in New Orleans and all the things they did to keep their generators and machines operational. Do you have employees who are willing to make personal sacrifices to keep your business going? Will you have the supplies available to keep the people supporting the computers fed and rested?

Test, Test, Test

I highly recommend testing your D/R documentation. If it doesn’t exist, I’d start working on it. Are you prepared to continue functioning when the next disaster strikes? Will a backhoe knock out communications to your site and leave you without the ability to continue serving your customers? Do you have a BCRS contract in place? I know I don’t want to end up like the guy in the cartoon complaining that he did not have good backups and D/R procedures in place. Do you?

Network Troubleshooting

Edit: It has been a while since I needed to mess with SSA disks.

Originally posted September 2005 by IBM Systems Magazine

Recently, a user opened a problem ticket reporting that copying files back and forth from a server we support was taking an unusually long time. The files weren’t all that large, but the throughput was just terrible. After poking around a bit, we found that the Ethernet card wasn’t set to the correct speed. When we ran lsattr -El ent0, we found the media_speed set to Auto_Negotiation. I knew what the problem was immediately.

We’ve seen the Auto_Negotiation setting on Ethernet adapters to be problematic on AIX. Our fast Ethernet port on the switch was always set to be 100/Full. With Auto_Negotiation on, sometimes the card would correctly set itself to 100/Full, but at other times it would go to 100/Half. This causes the slowdown on the network because you now have collisions on the network, which you can see with netstat -v.

Packets with Transmit collisions:

 1 collisions: 204076      6 collisions: 37         11 collisions: 1

 2 collisions: 65375       7 collisions: 6          12 collisions: 0

 3 collisions: 16894       8 collisions: 2          13 collisions: 0

 4 collisions: 2404        9 collisions: 0          14 collisions: 0

 5 collisions: 255        10 collisions: 2          15 collisions: 0

You can also determine if you’re having Receive Errors and see what speed your adapter is running at by using netstat -v.  You’ll see something similar to the following:

RJ45 Port Link Status : up

Media Speed Selected: Auto negotiation

Media Speed Running: 100 Mbps Full Duplex

Transmit Statistics:                      Receive Statistics:

——————–                          ——————-

Packets: 33608151                      Packets: 82280769

Bytes: 3364953629                      Bytes: 89992126877

Interrupts: 15105                          Interrupts: 79762362

Transmit Errors: 0                         Receive Errors: 14000

Packets Dropped: 1                      Packets Dropped: 14

                                                     Bad Packets: 0

How did we fix the duplex issue? We detached the interface and ran a chdev to make it 100/Full: chdev  -l ‘ent0′ -a media_speed=’100_Full_Duplex’. Once we made this change, there were no more collisions and the user was a happy camper.

Verifying Failed SSA Disks

Another issue that seems to crop up is when SSA disks die. How do you know which physical disk in your drawer needs to be replaced? In some instances, when the disk dies, you’re no longer able to go into Diag / Task Selection / SSA Service Aids / Link Verification to select your disk and identify it because it’s no longer responding. 

In this situation, you can use link verification to identify the SSA disks on either side of the failed disk. You can then look for the disk that’s between the two blinking disks, and you know which disk is bad. Another way to verify that you’ve selected the correct disk to replace is to run lsattr -El pdiskX, where “X” is replaced with your failing pdisk number. This provides the serial number that you can match with the serial number printed on the disk. (Note: The serial number may not be an exact match, but you can match fields 5-12 in the output (omit the trailing 00D) with the printed serial number on the disk.) Here’s the highlighted output:

lsattr -El pdisk45

adapter_a       ssa3             Adapter connection                                   False

adapter_b       none             Adapter connection                                   False

connwhere_shad  006094FE94A100D  SSA Connection Location                              False

enclosure       00000004AC14CB52 Identifier of enclosure containing the Physical Disk False

location                         Location Label                                       True

primary_adapter adapter_a        Primary adapter                                      True

size_in_mb      36400            Size in Megabytes

Another way to find your disk based on its location codes is by using lsdev -C | grep pdiskX. After replacing it, you can simply run rmdev -dl pdiskX, swap it with your replacement disk and run cfgmgr.

If your SSA disk was part of a raid array, hopefully at this point your hot spare took over, and you can just make your replacement disk the new hot spare disk. To make your disk a hot spare, use diag / task selection / ssa service aids / smit — ssa raid arrays / change show use of an ssa physical disk, and change your newly replaced disk from a system disk to a hot spare disk. To verify all is well, I like to go into smitty / devices / SSA RAID Arrays / List Status of Hot Spare Protection for an SSA RAID Array. It should report that the raid array is protected and the status is good. Keep in mind that only the latest SSA adapter (4-P) will allow list status of hot spare protection to work; older cards such as the 4-N don’t have this feature.

Exploring Linux Backup Utilities

Edit: I still really like Storix. Relax and Recover is pretty popular as well.

Originally posted April 2005 by IBM Systems Magazine

I’ve been an AIX administrator for a while now, and the mksysb and sysback utilities, which allow me to do bare-metal restores and return my machines to the state they were in when I last performed a backup, have spoiled me. As I’ve worked more with Linux machines, I’ve been bothered that they lack the equivalent utilities.

This is not to say that backup options don’t exist at all for Linux machines. Some use the dd command to copy the entire disk. Many have written scripts that use the UNIX shell command tape archive (tar) to run the open-source utility rsync, which duplicates data across directories, file systems or networked computers.

Others use the Advanced Maryland Automatic Network Disk Archiver (AMANDA) or dump. According to the University of Maryland’s AMANDA Web site, AMANDA lets a LAN administrator, “set up a single master backup server to back up multiple hosts to a single, large capacity tape drive.” AMANDA can use native dump facilities to do this.

Dump itself creates an archive directory and an access interface across whole file systems. It lets you build in specifications for when to run the dump, but it isn’t for everyone because it might not support all the file systems you need.

Two more advanced archive tactics are the cpio and afio utilities, both of which are considered to have better consistency and integrity than tar, with which they’re backward compatible. However, they may require a lot of time reading the man page and other resources to use them effectively.

In my Google search, I also came across “Linux Complete Backup and Recovery HOWTO”, run by software engineer Charles Curley, a 25-year veteran of the computing industry. This site provides instructions for using a backup and restore methods for several Linux products.

The solutions I’ve talked about can be more attractive to Linux users because they’re free. Each of them also makes you build a minimal system before you can restore the rest of your system. When I argue AIX versus Linux, bare-metal restores is usually something I can bring up that Linux advocates can’t address. So I started wondering if there was a tool that was exactly like mksysb or sysback, where you could boot up and restore your machine in one step. The only tool I found–Storix–offers a free personal edition and a free demo edition that you can use to test it with, but if you want the full benefits of this software, you’ll need to buy a license. It isn’t free software, and that may deter some in the Linux community.

Storix offers functionality similar to tools such as tar and secure shell (SSH) for backing up a machine over the network to a remote machine or tape drive, a local file, a tape drive or a USB disk drive. The restore is where Storix has a different functionality, working as a true bare-metal restore. While other tools require you to reload the OS before running the restore process, Storix reloads the entire machine from the bootable CD, eliminating time spent configuring user IDs, groups, permissions, file systems, applications etc. There are far too many places to inadvertently leave out something that has already been fixed when you rely on people to go around rebuilding machines when the hard drive dies. Bare-metal restores may help users feel more comfortable about making changes to the existing system because they can return the machine directly to its previous state.

There are as many methods to back up your machine as there are reasons to choose them. So go ahead and destroy your machine. Just make sure you have a good backup plan before you do so, no matter which tool you choose.

Preparing for Your Certification

Edit: Some links no longer work.

Originally posted December 2004 by IBM Systems Magazine

A co-worker recently finished the requirements for his IBM pSeries certification. I asked what he’d done to prepare for and pass the IBM eServer Certified Advanced Technical Expert: pSeries and AIX 5L (CATE). What follows are some ideas that came up while talking with him and another CATE-certified co-worker. In many cases, the primary attribute one needs to achieve this certification isn’t intelligence or skill with AIX, but the motivation to study and schedule the test.

Some people believe that taking tests and getting certifications are of no benefit and flatly refuse to do so. Others want to take every test available so they can prove to the world that they’re fully qualified for the tasks at hand. I’ve known plenty of people without certifications who were top-notch performers and really knew the material. I’ve also known people with certifications who, despite their book knowledge and test-taking abilities, lacked practical application skills. I believe a certification demonstrates that you’re familiar with the material and know enough about it to go pass a test. In some instances, employers and potential employers will examine your certifications during the hiring process. As is pointed out on IBM’s Certification Web site, certification is a way to “lay the groundwork for your personal journey to become a world-class resource to your customers, colleagues, and company.”

When preparing to take these tests, my friends told me that they would first visit the IBM Web site or directly access the tests, educational resources and sample tests. On IBM’s Certification Web site, the Test Information heading links to education resources, including Redbooks, which can be ordered or downloaded as PDF files. CATE certification requirements are outlined here.

To get started, determine if you meet the prerequisites, some of which are required and some of which are recommended. You can then choose three core requirement tests to take. You must take at least one of these tests–233, 234, 235, 236; you may substitute Test 187 for Test 237 and Test 195 for Test 197. Each test lists objectives, samples, recommended educational resources and assessment tools. After choosing your tests and acquiring your preparation materials, you’re ready to study. Some people take their time, reading Redbook chapters here and there as time permits. Others set study goals–read a chapter a day, read for 30 minutes a day or some other method. Still others find that going to a class works best for them. Use whatever method suits your learning style. At the end of each Redbook chapter is a short quiz to measure your understanding of the material presented in the chapter. These are excellent tools for verifying that you’re ready to take the test.

One co-worker would call the testing location and schedule a day and time for the test. He found that having that deadline looming kept him from procrastinating and forced him to work at acquiring the knowledge. My other co-worker preferred to methodically study the material, and wouldn’t schedule his test until he was sure that he had a good knowledge of the subject.

How ever you go about it, there’s a great satisfaction that comes from passing these certification tests. And while it doesn’t prove anything more than your familiarity with the subject and your ability to pass a test, it may make the difference between you and another candidate in your hunt for a promotion or a new position.

Software Provides ‘Remote’ Possibilities

Edit: I still use these tools. Although I cannot remember the last time I ran telnet.

Originally posted June 2004 by IBM Systems Magazine

Have you ever wanted to remotely control your Windows* machine from a machine running AIX* or Linux while you were working on the raised floor? Have you ever started a long-running job from your office and wanted to disconnect and reconnect from home or another location? Have you ever rebooted your machine after a visit from the “blue screen of death”? Did the reboot interrupt your Telnet or Secure Shell (SSH) session, requiring you to log back and start over again? Have you ever wanted to share your desktop with another user for training or debugging purposes?

If you answered yes to any of these questions, then Virtual Network Computing (VNC) and screen are two useful tools worth investigating.

VNC Benefits

Developed by AT&T, VNC is currently supported by the original authors along with other groups that have modified the original code. Realvnc, tridiavnc and tightvnc are all different versions that interoperate seamlessly.

VNC has cross-platform capabilities. For example, a desktop running on a Linux machine can be displayed on a Windows PC or a Solaris* machine. The Java* viewer allows desktops to be viewed with any Java-capable browser. With the Windows server, users can view the desktop of a remote Windows machine on any of these platforms using the same viewer.

This free tool is quick to download–the Java viewer is less than 100K. The AIX* toolbox for Linux applications also has a copy of VNC.

VNC is comparable to pcAnywhere or other widely used remote-control software. VNC’s power is in the number of OSs that it can allow to interoperate. AIX controlling Linux, Linux controlling Windows, Windows controlling them both–these are just some of the possibilities.

After loading VNC with smitty, you can start it by running vncserver on the command line. I recommend creating a separate user that hasn’t previously logged into an X session; I’ve seen strange behavior when using the same user ID to start a normal X and VNC session.

vncserver twice prompts you for the same password. This information is stored in ~/.vnc/passwd and can be changed with the vncpasswd command. (Note: This directory also contains the xstartup configuration file, along with some log files that show the times and IP addresses of the clients that have connected to vncserver.) Each time you run vncserver, you’ll have another virtual X desktop. (The first session runs on :1, the second on :2, etc.)

Verify that VNC is running with the ps -ef | grep vnc command (you should see Xvnc running). Connect to your server from your client machine, then run vncviewer. When prompted for the server name, enter either the IP address or the host name, followed by the session number you’re connecting to. VNC Web sites usually use snoopy:1 as a sample host name. You should then be prompted for the password you set up earlier.

At this point you should see an X desktop. The settings and version (CDE, KDE, etc.) can all be specified in ~/.vnc/xstartup. To allow others to view this session, select the shared session option from the command line or the shared session setting from the GUI if your viewer is running in Windows. Once two or more users have connected to the vncserver, you can share the session, unless you select the “view only” option when connecting.

Much of this information applies when running vncserver on Windows. Once you download the installer, install the service and set up a password, you should be able to connect to your Windows machine by running vncviewer on your AIX machine. When you connect to your Windows machine, you don’t need to specify any display numbers after the host name, as there’s only one screen that you can connect to on a Windows machine.

This is a powerful tool when team members are working remotely and are having trouble explaining to you what they’re seeing. Once you fire up a VNC session, the problem is usually apparent. It’s also a great tool to install applications that require X when nobody feels like walking out to a raised floor or when the machine is running headless and nobody wants to hook up a monitor to it.

The real power comes when you close down vncviewer and then run it again from another location. You’ll be connected right where you left off–assuming the machine hasn’t been rebooted and no one’s stopped the vncserver process. When it’s time to stop the process, run vncserver -kill :X (X is the session number you’re running for your individual situation).

The Power of Screen

Another useful tool included with the toolbox CD is screen, which allows a physical terminal to handle several processes, typically interactive shells. After loading screen with smitty, enter “screen” on the command line. You’ll see copyright information; hit “space” or “enter” to proceed. You’ll then see a typical command prompt.

To get started, vi a file, or read a manual page. When you need another command line, simply enter “ctl-a c” to create another session. You’ll be greeted with another prompt. You can continue to create several virtual sessions and toggle between them by entering “ctl-a space.” Entering “ctl-a 1” returns you to the first screen, and “ctl-a 2” to the second, etc. To list all of your windows, enter “ctl-a w.” This displays all of the key bindings that are possible in screen. If you need to detach from your session, enter “ctl-a d.” If you’re on another machine and want to attach to a screen running elsewhere, run screen -d (to detach) and screen -r (to reattach).

Power Combo

When I combine VNC and screen, I can use vncviewer to connect to my vncserver, which is running an xterm that’s running screen. My xterm takes up a small amount of real estate on the desktop, and I can quickly and efficiently move between my virtual command lines. This allows me to remain logged into multiple machines and quickly and easily switch back and forth. I also can easily cut and paste between OSs–cutting from AIX and pasting into an e-mail client running on Windows or taking information from Windows and pasting it into my vncsession.

VNC and screen are a powerful combination. With these tools, you can drop what you’re doing (or get dropped, in the case of a network or OS outage), go to another location and pick up where you left off. It’s a handy way to work.

VMware Provides Virtual Infrastructure Solutions

Edit: Much has changed since this has written, but it is still a great tool to run multiple operating systems on the same machine.

Originally posted May 2004 by IBM Systems Magazine

So you’ve heard Linux is the wave of the future and you want to try it out, but you don’t have a spare machine to load it on? You find yourself on the road with one laptop but would like to be able to run more than one operating system without dual booting? You’re already running Linux as your desktop OS and have the occasional need to run Windows applications? You do development work and need test machines to crash and burn? VMware may be the right solution for you.

At vmware.com, you can download the trial version and try a demo version of this software for 30 days or purchase the software from the Web site. There’s good documentation on the site to give you a more thorough overview of what you can expect when you run the software, and it goes into detail about the VMware server offerings. I’ll be focusing on the workstation version of VMware.

Once you’ve loaded the software, configured the amount of memory and disk you plan to allocate to your virtual machine, and decided which type of networking support you want, you’re ready to power it on. You treat VMware like a regular PC, and you can create as many virtual machines as you like. This way you can run RedHat, SUSE, Debian, Gentoo, Mandrake and Windows 95, 98, or XP all on the same computer at the same time. (Although I wouldn’t recommend trying to run them all at once, unless you have a huge amount of RAM.) Put the boot CD for the OS into the CD-ROM drive, press the virtual power button and your virtual machine will load your OS for you.

Once you’ve loaded and patched your OS, you can load the VMware tools package. From the Web site, “With the VMware Tools SVGA driver installed, Workstation supports significantly faster graphics performance. The VMware Tools package provides support required for shared folders and for drag and drop operations. Other tools in the package support synchronization of time in the guest operating system with time on the host, automatic grabbing and releasing of the mouse cursor, copying and pasting between guest and host, and improved mouse performance in some guest operating systems.”

I find that VMware really shines when I put it into full screen mode, and forget that I’m even running a guest operating system. I use the machine as if it were running Linux natively, until I find I need to do something in Windows, at which time I “Ctrl-Alt” back into my Windows session. Then, when I’m finished, I either power down the virtual machine (shutdown -h now) or just hit the suspend button to hibernate my virtual session so that I can pick it up where I left off the next time I want to use that OS.

Another nice feature is the ability to take the disk files (Linux.vmdk) that represent my current hard drive configuration and copy them to CD or send them over the network to a coworker. That coworker can then boot your exact machine configuration to help you look at bugs, see how your desktop is set up or see exactly how your OS is configured.

If your organization finds that it doesn’t have the funds to allocate to a room full of test machines, or you need to take multiple machines with you for a presentation or demo, VMware is a solution to consider. Why settle for running Linux on that old machine in the corner when you can run it at the same time you run your primary workstation?