After some apparent RAID hardware issues earlier in the year we decided to perform some updates and reconfiguration on the ROSA DAS systems to increase resilience.
The 'old' configuration on the DAS systems had a RAID-1 array for the OS (two disks mirroring each other exactly) and a RAID-0 array for observing data (8 disks with data striped across them for speed). The latter setup is needed for write performance but has no redundancy, so if a single drive fails then the entire RAID is lost. It transpired that various power glitches had likely caused the RAID-0 arrays on DAS3 and DAS4 to fall out of sync, with one or more drives being regarded as foreign by the RAID BIOS. Unfortunately the BIOS would warn about the foreign drives at boot time but continue to boot regardless, and the OS would then hang when trying to mount the drive as large parts of the partitions were missing. The situation was remedied by using the RAID BIOS to erase the 'foreign' drive setup and reconstruct the RAID-0, though this was at the cost of losing all data from the RAID-0 partition.
To guard against this problem recurring I reconfigured the RAID arrays from RAID-0 to RAID-10 - this configuration means that the data is striped across pairs of hard drives, giving redundancy and speed, though at the cost of capacity as one is now devoting half of the hard drives to a backup role. The capacity issue was mitigated by replacing the 147GB HDDs in DAS3-6 with 600GB units resulting in a net increase in available storage; DAS1/2 already used 600GB HDDs so have had their net capacity reduced from around 4TB to 2TB, but are of course now much more robust.
Each DAS machine still has a RAID-1 for the OS, and now has a RAID-10 for the data partition. On most machines the RAID-10 is composed of 8 600GB HDDs for a total formatted capacity of 2TB; DAS3 has a 6-drive RAID-10 due to an apparent hardware issue with one of the drive bays (see below) and so only has 1.5TB data space. Due to the transient hardware issue on DAS3 the redundancy of the RAID-10 was tested in actual operation, and it seemed to work as expected!
In addition to the hardware changes I have configured the OS on the machines to mount the data partition after the main boot process, so the system should not simply hang at a blank screen but boot to a usable state even if there is a problem.
To avoid problems with future runs I make the following suggestions:
check_das_drivesshould be run daily from the gateway root account to ensure that all arrays report being in the optimal state. See below for details on this script.
To be added
In case of problems, contact me immediately. As I don't normally check my QUB email off-site use my personal address - robertryans @ me.com - or phone me on +447837835852. Mail to that account will show up on my phone immediately, and that's my mobile number; my iPhone is rarely more than a meter from me.
Warning - use of these details for anything other than a ROSA emergency will result in severe offence to the caller.