OS/2 eZine - http://www.os2ezine.com
Spacer
16 December 2001
 
Isaac Leung got a degree in Engineering Physics and then Electrical Engineering after which he promptly got a job as a product engineer at a company which makes high speed datacom chips. He is old enough to have cut his computer teeth on Commodore 64's and first played with OS/2 1.3 EE while at a summer job with IBM. The first PC he ever owned came with Windows 95, but he soon slapped on OS/2 Warp 3 and has been Warping ever since. In his spare time, he plots to take over the world.

If you have a comment about the content of this article, please feel free to vent in the OS/2 eZine discussion forums.

There is also a Printer Friendly version of this page.

Spacer
Previous Article
Home
Next Article


Do you have an OS/2 product or service you'd like to advertise?


Why Won't IBM Fix My Bugs

One of the most common complaints that I see from perusing newsgroup postings is that IBM is refusing to fix, or even acknowledge, a certain bug, or that they are taking too long to release a fix. Users are understandably very frustrated when this occurs. What they don't see, of course, is what goes on behind the scenes at IBM.

I'll start off with a disclaimer. I do not work at IBM, nor do I speak for IBM! I did work for IBM one summer, but the closest I got to OS/2 was installing OS/2 1.3 EE on a PS/2 Model 50Z. I do work at a company which designs communications chips with very big customers. We make up part of the NASDAQ index, as well as the S&P 500, so there is a reputation to keep.

Some of our customers are the biggest network card, router and printer manufacturers in the world. The point being, we are accustomed to dealing with large customers and large vendors (including IBM). So, I can't tell you exactly what goes on behind the scenes at IBM, but I can give you some idea.

One thing to keep in mind is that while we sometimes see IBM employees in the newsgroups, it is not their responsibility or obligation to be present. Just because they happen to be there does not mean they speak in the capacity as an IBM employee.

Bug? What bug?

First of all, if there's a problem, somebody has to be aware of it. This may sound obvious, but it's complicated by a small problem. Specifically, somebody who matters has to be aware of the problem and it has to be officially reported.

A lot of times, issues about OS/2 are brought up in the newsgroups. This is not an official channel to notify IBM of any problems. In fact, we should consider ourselves lucky that IBM's employees hang out on the newsgroups and listen to our gripes. As large as IBM is, they also have limited resources.

One example is the bug which caused OS/2 to hang if you repeatedly open and closed Netscape 4.61. It's just been fixed in the latest Fixpack for the Convenience Package. It hadn't been earlier because a few people occasionally talked about it on the Usenet and that was it. Fortunately for us, eventually, with the unofficial help from IBM employees and many other net denizens, the issue was identified and routed to the proper channels and officially reported as a bug.

At the company I work for, once a customer of ours manages to get a bug reported, it gets assigned a number and filed in a database. This list of problems is reviewed regularly (e.g. every month) to make sure progress is being made and the issues are addressed. Judging by the "APAR" numbers that IBM assigns, it's not unreasonable to assume that they have a similar process.

Once is never enough...

Just reporting it isn't enough though. In order to merit more attention, this bug has to happen repeatably. If you can make it happen a lot of times, then it may be worth a second look. Preferably, somebody else (i.e. not you) should be able to replicate the problem. If our customer reports a bug once, but fails to report it again and nobody else can reproduce the problem, then we often mark off that item as complete. Maybe your OS/2 system failed to boot once. Is that an OS/2 bug? If millions of other users do not have this problem and IBM fails to reproduce the problem, then perhaps it was something specific to your setup at that time, like a power surge, for example. (Or in our case, alpha particles from the lead used in solder striking one of the memory cells and flipping the contents).

Take the example of that little Netscape bug. At first only a handful of people reported it. A lot of other people reported that it was fine for them. It turned out that they (and I was one of them) were wrong of course, but that wasn't really IBM's fault because...

Details please!

Even when a bug does get officially reported, and the customer can claim that it is repeatable, often that is still not enough for us to do anything. We need to know the exact setup and mode that the chip is running in. Or, they need to ship us their equipment so that we can use it. It's quite simple really, if we can't break it, we can't fix it!

I'll stick with my previous example about Netscape (and keep in mind this is the story as seen through my eyes, not necessarily the truth, whatever that is). What happened with the Netscape bug was that people were not initially giving any details of their setup. I was one of the original people who reported that it was fine, no problem. Eventually, it turned out that was because I hadn't open and closed it enough times. But initially, nobody mentioned how many times was enough, so a lot of people (IBM included) couldn't replicate the problem.

What needs to be reported are very specific details. For example,

"I'm running an IBM Thinkpad 770X with Warp 4, Fixpack 12 and no desktop or system enhancers of any sort. Clean install with no 3rd party applications installed. I open and close Netscape (June 2000 release) 16 times and on the 17th time, the system will hang. I've repeated it on the same software setup on an HP Vectra VLi8 desktop with the same results".

Now that's the sort of detail engineers can work with!

That's not the sort of detailed report that the average user can provide. I'm pretty sure people like my mom can't. So you can begin to understand why companies have a hard time dealing with individual home users. (Techncially speaking, anyone could buy our communications chips. But anyone who knew about them and what to do with them and would actually fork out the money for one would be able to provide this sort of response. It's a nice way to avoid the problem , something that IBM or Microsoft doesn't have the same luxury of having, except by providing no support for individual users).

IBM has the additional problem of trying to reproduce the failure on the specific setup that it was reported on. Unless you bought a "standard" machine from one of the big vendors, the number of possible configurations are enormous! Try this example: 5 motherboard makers, 10 types of video cards, 3 sound cards, 3 hard drive manufacturers, 6 CD-ROM, 3 mice. That's already 8100 possible configurations. I'm already understating the amount of choices available, and I'm also not counting older hardware that isn't currently for sale, but a user might be using. Sometimes, the bug might occur with only a particular piece of equipment in conjunction with a specific piece of software. Suppose IBM didn't have it. Suppose it's not for sale anymore. I'm sure you can begin to appreciate the difficulties now!

Prioritizing

Okay, so now we know about a bug, we have all the details and we can reproduce it on demand. What's next? Well, when we review our list of bugs, it isn't on a first-come, first-serve basis! The bugs are evaluated by the seriousness of the bug and how many (or sometimes who) it affects.

A serious bug might be, for example, one that caused every 2nd packet to be corrupted in every operating mode. (Not that such a bug would ever make it into a production release.) A less serious one might be, for example, that every 15 million packets, the device resets itself, but then continues operating normally causing a 5 second loss of connectivity. So certainly, if the bug affects you, you would be very annoyed! But if there is an easy work around or it does not cause critical loss of data then it is of a lower priority to fix. In fact, if we can determine an easy work-around without having to re-design the chip, we often simply issue a literature release informing our customers. Customers are usually okay with this and prefer this method, for a reason that I'll go into later. That reason is the magical production status.

Of course, money always talks too. While there may be a serious bug, but if it only affects a small customer instead of many customers or one large customer, the problem might receive lower priority. It all boils down to having limited resources. In an ideal world, all bugs would get fixed at the same time, as quickly as possible. But if you can only fix one at a time, then of course you must prioritize.

It's like being in the Emergency Ward at the hospital. The patient with a broken leg would really, really like to be tended to now, and to him, it would be a very serious matter if he (or she) had to wait. (And there's no doubt they might be in a great deal of pain too.) But if there's only one doctor on duty, and a patient came in with a punctured lung, there's no doubt that the one with the broken leg will just have to wait!

Is it fixed yet?

So, you've finally filed a bug report, got it acknowledged, we've repeated the problem in our labs. We've even given it high priority and we think we've got the fix. All done, right? Err... not quite...

Here's where that touchy little issue about "production" status comes in. It's something I personally never understood until I got out into industry. For an "organization" such as Linux, fixes come as soon as someone's coded it up and compiled it. It's available, if you want to try an 'untested' patch. (And in fact, that's what the stuff on IBM's Testcase is like.) Big business does not like this.

On our production machines, we run a particular flavour of UNIX. Two flavours in fact, big ones. But not the latest versions, and it's not because of cost, although that is part of it. We do not know if there will be new problems with a newer release of of this UNIX, all we know is that this one works, so unless we are forced at gunpoint, we do not move. This is also the case with our customers, even if we release a revision to a chip, we really have to jump through a lot of hoops to convince our customers to use the new design, they do not like changes either. Production status = stability, as in no changes.

IBM will have much the same problem with its customers. You can't just release a fix as soon as it seems to solve the problem. You can never be 100% sure you didn't introduce a new bug. (As someone once said, the only bug free code is the code you didn't write). Computer enthusiasts may be able to deal with that, but the big bank HSBC or the Bank of Brazil is not going to tolerate a new bug on thousands of its terminals. I'd be willing to bet that if the bug that IBM identified didn't affect them, they won't upgrade.

[As an aside here, I once helped a fellow who ran into me on the Internet. His problem was that their company's product used OS/2 Warp 3. At that time IBM was ready to phase out Warp 3 and stop selling it. He was in quite a bind, since his company was exceedingly reluctant to use Warp 4, because Warp 3 worked just fine. Upgrading to Warp 4 would entail all sorts of extra cost and effort, such as...]

Documentation

Engineers (at least the good ones) actually spend most of their time... writing. If you want to be great, not just good, you'll have to be able to write well. This isn't the Dark Ages, where mad alchemists kept secrets to themselves! If you do something, you need to document it so others can support your work, fix bugs or reproduce what you've created.

The result is that any time we revise our chips, we have to issue new, updated documentation. One of those documents is a good description of the device and all its functionality. And any complex chip nowadays is going to have a specification document anywhere from several hundred to several thousand pages. Somebody is going to have to update it. Once it's done, it's got to be circulated for review and approved before it can be released. It's probably going to be the same thing for OS/2, it isn't some tiny piece of software with just 10 functions to document.

Oh, but it gets worse, because ... did you notice I said "..one of those documents..."? Yes, there's more. That's just the document that most customers or developers usually deal with. But not only do you have to document the new functions or specifications, you also have to prove you fixed the problem, which requires...

Regression Testing

In this business, or just about any decent software firm, there's this tiny issue called "regression testing". The purpose of this test is to ensure that you have not broken any features with your fix. It's really simple to say, not so easy to do.

OS/2 Warp 4, original public release, probably had a set of tests it had to go through to ensure that it worked as designed. When Fixpack 1 came along, not only do they have to test that the bugs are fixed, but it probably went through at least some of the original tests to make sure that it still worked as designed. At Fixpack 2, new bugs were fixed, but then they had to test to ensure that Fixpack 1 bugs were still fixed, and that original functionality was still preserved.

Move along to Fixpack 15. Take a look at IBM's APAR list that came with that. Think of all the testing that needs to be done. Obvious, IBM needs to test the Fixpack 15 list of "fixed bugs" to ensure that they are indeed fixed. Then they have to go back through FP14, FP13, .... all the way down the APAR list to the FP1 bugs to make sure all those bugs are still fixed.

As you can imagine, the more time passes, the bigger the list of bugs and the longer it would take to test each release. Now I'm sure IBM has some procedure for speeding this up, such as skipping some tests (that's probably how the FTP Host Template bug got slipped in at FP12 or so), but the amount of testing still remains daunting. It takes even longer if you have to test it on several hardware configurations too.

Is it finally done?

Well I sure hope so, I don't know about you, but I've certainly had enough! By this time, the bug has been fixed, all the documentation has been updated and the new version has been all tested and is ready to go. Only now will it be ready to release into the public's eager hard drives. (And that's after you properly prepare the "distribution channels", or in IBM's case, the FTP sites and associated files and the RSU installation files. That's no small task itself, you have to update all sites that point to the Fixpack locations and make sure they're all correct.)

I get tired just writing about this, I wouldn't want to be the one to actually release a Fixpack! I hope some IBM'er will correct me if there's any mistakes (or provide extra detail), but at the least, I hope I've given you some appreciation of what it takes to get a bug fixed in a product when you are dealing with a large corporation that deals with big business customers whose products and systems are in production and making big money for them. And it's for you too. How would you like it if, for example, 3Com changed the chip in their 3C905 card without telling you, and without testing it to make sure that everything still worked the same as before? Not happy I bet! That's why the big companies (with lots of customers) will ask for so much before a change is allowed.

Next time a fixpack is released, we should all cut the IBM OS/2'ers a bit of slack, I'm sure they worked hard to release as many fixes as they could in a timely manner.

Previous Article
Home
Next Article

Copyright (C) 2001. All Rights Reserved.