Sunday, March 20, 2005

Penguins and Linux: A love-hate relationship

If you are wondering where I have been for the past two days, it is simple: I have been buried deep in the guts of my computer, which locked up from time to time then when reset, came back rebuilding one of my RAID arrays.

First, a quick run-down of my computer's configuration. This is not an el-cheapo setup by any means. I make my living with this thing, and I long ago learned that skimping on the tools of your trade is no virtue, quality pays. This thing has a top of the line Antec case and power supply. A Promise 4-port SATA controller. A top-of-the-line (at the time) Soyo motherboard with a one-notch-below top of the line (at the time) VIA chipset that has two SATA ports onboard, as well as normal IDE, Ethernet, Firewire, and USB2.0. A middle-of-the-road NVIDIA video card (ultimate video performance is not required for my purposes, whereas absolute stability is). A 2.1ghz AMD Athlon XP processor (one notch below top of the line at the time) and 512MB of RAM. And, most importantly, three 160GB Maxtor 7200RPM SATA hard drives, arranged as one RAID1 (mirroring) array across the bottom for my boot partition, and five Linux RAID5 arrays splitting up the rest of the drives for my various partitions. All of this is running Red Hat Fedora Core 3 Linux. To round things out, I have a Firewire DVD burner and 250GB Firewire external hard drive that get shared with my laptop for backups, and of course a couple of printers (a laser printer for invoices and other business printing, an Epson inkjet for my personal printing).

Okay, the first notion was that because I heard a loud click from the hard drive area when it locked up, it was a hard drive locking up. The problem is this: Why in the world should a hard drive locking up take out the whole freakin' computer?! I mean, the whole point of RAID is that if a hard drive goes AWOL, only that one hard drive goes out, not the whole friggin' system! So I did a read test of all the drives. Unfortunately, they all passed the read test (i.e., I could read every byte on every drive). It was clear that a read test wasn't going to diagnose the problem.

At the time I had the on-board SATA disabled, and was using only the four-port Promise SATA. So I decided, hmm, maybe Promise's driver for Linux sucks. So I enabled the onboard SATA and moved two drives to it, and rolled a new initrd with the driver for the onboard SATA and rebooted into the new configuration.

Well, it locked up again when I tried to back up everything to the firewire hard drive. So I said to myself, "Hmm, maybe it's the one that's still on the Promise controller." So I swapped cables.

And sure enough, it locked up again the next time 'round... but *THIS* time, using the VIA SATA driver, it printed out which drive was gumming the works. So I unplugged that drive, rebooted (into degraded mode), backed up my stuff to the Firewire hard drive with no problem (which 100% verified that the drive in question was the evil one gumming up the works), and then I had to repair the thing.

So a quick trip to Fry's Electronics, grab a new drive, install it, boot into rescue, copy the partition table from one of the other drives, manually assemble the RAID arrays in two-drive (degraded mode), hot add the new partitions to the RAID arrays, wait for the RAID arrays to rebuild (25 minutes -- this is a FAST computer), and... It works!

Lessons learned:

  1. Linux sucks.
  2. Linux, out of the box, on commodity hardware, is not ready for mission-critical purposes because it locks up under situations that are absolutely unacceptable. (Note: There are Linux systems that do not share this flaw, but they are running special hardware, not commodity hardware).
  3. That said, I would not have fared any better under Windows. Indeed, with all the hardware-swapping involved, I probably would have ended up having to re-install Windows (I have had very poor luck getting Windows to deal with massive hardware changes).
  4. Penguins still love Linux, but hate it too.
And now that this real world exercise in technological sado-masochism is over, maybe I can get some work -- and blogging -- done :-}.

- Badtux the Linux Penguin

No comments:

Post a Comment

Ground rules: Comments that consist solely of insults, fact-free talking points, are off-topic, or simply spam the same argument over and over will be deleted. The penguin is the only one allowed to be an ass here. All viewpoints, however, are welcomed, even if I disagree vehemently with you.

WARNING: You are entitled to create your own arguments, but you are NOT entitled to create your own facts. If you spew scientific denialism, or insist that the sky is purple, or otherwise insist that your made-up universe of pink unicorns and cotton candy trees is "real", well -- expect the banhammer.

Note: Only a member of this blog may post a comment.