Featured image of post Storage Update (and woes)

Storage Update (and woes)

As I mentioned in my previous post, I’ve had my Norco case sitting around gathering dust for quite a few months.

Well, all the bits and pieces finally arrived and I had a chance to build it sometime around mid-September. Boy, she’s a beaut! The philosophy behind the build was to only build it once - I’m not going to build another storage chassis for the next 5+ years, just you see!

This baby is packed with:

  • 1x Intel Core2 i7 930 2.80Ghz Quad-Core CPU
  • 1x Corsair HX3X12G1333C9 (12GB DDR3 Kit)
  • 1x Zotac NVIDIA GT210
  • 1x Corsair TX950 Power Supply
  • 6x Western Digital WD2001-FASS Hard Drive
  • 1x Seagate Momentus XT 320GB Hard Drive
  • 2x AOC-USAS-L8i Controllers

For the most part, the build went quite smoothly - but I’ve had a few follow-up problems which have caused me a large amount of grief.

The first big problem I had was that drives were not coming up on the controller very reliably. A friend with the same chassis had warned me that there were potential problems with the SFF-8087 jacks, but that it was easily solved.

After about a week of troubleshooting, it appears the cables I have were having two major problems. The first is the problem above - the chassis support is so low that the clip is held down (in the unlock position), so that if you move the case, the cable might rattle loose. This is easily solved by removing the chassis supports, detaching the backplanes and using the trusty dremel to create clearance.

Problem the second, it turns out that the cables I have also have extra moulding - pushes against the chassis supports so the contacts on the cable aren’t making contact properly. Instead of dremel’ing the chassis again (I was quite sick of pulling apart the bays at this point), I got a scalpel to the moulding and cut away the bits that weren’t needed, as they were cosmetic anyway. Figuring this out took another week (hooray).

Hooray, the box lives! After initialising the six drives in a RAID 5 array and laying down a nicely aligned EXT-4 partition, I decided to crank up a benchmark!

Running Bonnie++ against the virgin filesystem yielded 582 MiB/s sequential reads and 425 MiB/s sequential writes. Boo-fucking-yeah.

That done and dusted, given it’s my first time doing a software RAID with all my worldly data on it, it’s time for testing RAID failure and recovery - and it’s a good thing I did.

It turns out this box has another suprise in store for me. Adding a drive to the chassis causes existing drives to drop out of the Software RAID. Whoops, not good. Actually - given that >2 drives fall out of my RAID5 , that’s very bad. As in, showstopper.

It’s been about a month since I hit this problem - and with taking on the Treasurer role at WAIA and a hectic month at work - I’ve had stuff all time to look into this properly.

For a good while I’ve been troubleshooting on the premise that the Linux mptsas driver for this LSI chipset SAS controller card is buggy (and it is). I’ve since switched to running a 2.6.32 kernel (an Ubuntu backport with all the Canonical patches is available from the Kernel PPA). The 2.6.32 contains a much newer mptsas driver which resolves a lot of bugs. I’m not sure it’s the latest 4.18 version which also addresses a very dramatic smartctl issue (cheers to MetalPhreak on IRC for the heads up), but it seemed like the path to go.

When disconnecting the drives, dmesg would pop up saying all the drives had failed and then show them again under new device nodes. I had primarily been testing with a single row of drives, and was assuming the driver was changing drives (for some reason) because a disk had come up on a lower ID.

It’s only this evening, where I’ve been alert enough to approach the problem sensibly that I’ve noticed what the problem is. Firstly, the drives on the backplane are HIGHER Port IDs, and secondly (this is important folks) - all the drive activity lights were flicking on when a new drive is inserted.

This is a completely different scenario, in the nice sensible land of hardware. I like it here, there aren’t arrogant software maintainers who treat you like a numpty - but that’s a story for another time.

So either the controller is resetting all the drives on those ports (unlikely), or the backplane card is causing a power interruption or SATA connection drop when the other drive gets connected. Fortunately, I have a SFF-8087 to 4x SATA breakout cable, and it’s easy to test on the bench.

Testing with the backplane removed and hoo-fucking-rah, none of the other drives reset. So, it’s an issue with the backplane card. Not this one specifically, all of them.

So for now, I’ve shot the folks over at Norco an email asking for help, and hope it’s something simple like a known problem (replacement cards), or a jumper setting.

For now I’m not using this box, as I can’t trust the backplane cards - which is a severe dissapointment, but hopefully Norco look after me and can fix the problem so the chassis is hot-swappable as advertised. I’m going to be very upset if they can’t (or won’t) fix it.

But for now, after this essay of a post - I’m off to bed where I’m a viking!