![]() |
| May 2000 | Get BSD | New to BSD? | Search BSD | Submit News | FAQ | Contact Us | Join Us |
|
Years ago, when people spoke about RAID (Redundant Array of Inexpensive Disks) systems, they were often speaking about large servers with many disks, and storage space that was quite out of the price range of the average person. Since having additional disks increased the odds that one of those disks would fail, redundancy was required for these critical systems to ensure that the entire system did not go down if a single disk failed. RAID storage is still quite popular today, but is no longer limited to enterprise servers in big corporations. With software RAID available in FreeBSD, NetBSD, and OpenBSD, anyone with more than one disk can run some form of software RAID.
This article will discuss the merits and features of RAIDframe. RAIDframe is both a framework for rapid prototyping of RAID systems and the real-life software RAID implementation used in NetBSD and OpenBSD. This article will refer to RAIDframe as it is found in NetBSD, where its current development takes place.
RAIDframe was developed by the Parallel Data Laboratory at Carnegie Mellon University (CMU). The purpose of RAIDframe was to provide an environment where RAID experiments could easily be performed, and where new RAID algorithms could easily be implemented and tested. As distributed by CMU, RAIDframe consisted of a RAID simulator, a user-land disk driver, and a kernel-level device driver for (then) Digital Unix. RAIDframe, as found in NetBSD, is a fully-integrated kernel-level device driver. While this driver supports many new features (such as hot spares, component labels, and root on RAID) the core algorithms in RAIDframe have remained almost entirely unchanged. The years of testing in the simulation and experimental environments have provided an extremely solid RAID foundation on which additional functionality can be built.
A RAID set is made up of a number of 'components.' While some implementations may require a component to be an entire disk, in RAIDframe these components are simply partitions on disks or partitions on other RAID sets. There are no restrictions on the types of disks which can be used. While SCSI and IDE may be the most popular on desktop machines, other types (like HP-IB) have also been used.
The RAIDframe driver is a pseudo-device which behaves as a normal block and character device. Upper-level drivers talk to it through the regular open(), close(), and strategy() routines, while the underlying devices (be they other RAID devices, SCSI disks, or virtual devices) are communicated to via their corresponding IO strategy() routines. These clean and well-defined interfaces make not only interfacing with the RAID driver easy, but also make maintenance of the RAID driver easier as well. When RAIDframe was in initial testing on NetBSD, the ``disks'' used were VNDs (Vnode Disks) using files imported via NFS. After it became clear that things were working fairly well, IDE disks were used. Since then, almost all of the testing has been done using SCSI drives.
The devices available to the user are of the form /dev/{r,}raid[0-9][a-h]. These ``disks'' can be partitioned, formatted (via ``newfs'' for FFS, or ``newfs_lfs'' for LFS) as though they were a regular disk. Swapping to partitions on RAID sets is also supported.
RAIDframe supports the ``traditional'' RAID levels 0, 1, 4, and 5, and includes partial support for other RAID levels such as RAID level 6 (P+Q) and parity logging.
Concatenated components are not supported under RAIDframe. The ccd(4) device works fine if that functionality is required. N-way mirroring is also not currently supported.
A key to a number of the new features of RAIDframe is the use of ``component labels.'' Each component in a RAID set has a component label. This label contains information such as:
the number of rows and columns in the RAID set.
the position of this component in the RAID set.
a serial number for the RAID set.
a ``modification counter''
the component status
other configuration parameters needed to automatically configure the RAID set.
The information stored in the component labels completely describes a RAID set to the point where only the information in the component labels is needed to automatically configure the RAID set. Since information in the component labels is independent of the physical ``disk'' on which it resides, SCSI IDs (for example) can be switched around, and auto-configurable RAID sets will still configure automatically and correctly.
If auto-detection of RAID components and auto-configuration of RAID sets is enabled, the kernel will examine each disk partition of type ``RAID'' to see if it has a valid component label. The component matching algorithms group together related components, and if enough matching components are available the corresponding RAID set is configured. Because this auto-configuration occurs before the root partition is mounted, a partition on a RAID set can be used for the root filesystem, further increasing the robustness of the system. (The author's main machine has the root filesystem on a RAID 1 set.)
Since misconfiguration of a previously configured RAID set can destroy the data on the set, the configuration code takes great pains to attempt to ensure that the components being configured really belong together. Using the standard configuration techniques, the component labels are used to ensure that the components are specified in the correct order in the configuration file. (The ordering of components in the configuration file will probably become irrelevant at some point, with component labels being used as the sole determinant of where a component belongs).
RAID sets can be layered. That is, a RAID 0 set consisting of three RAID 1 sets is perfectly acceptable.
The ability to deal gracefully with component failures is probably the most important feature of any RAID system. The second most important feature is the ability to reconstruct the lost component(s) so that redundant operation can be resumed.
RAIDframe handles component failures quite gracefully. A number of different drives died during stress-testing of the RAID subsystem, and the tests continued on without a hitch, albeit in degraded mode. The author's main system suffered a disk failure on a RAID 5 set, and it was a week before the author even realized that a disk had failed! (Yes, the author is intending on improving the administrator notification when a component fails :) ) More recently, one of two disks in the author's RAID 1 set experienced a number of read errors, and again the system continued to function properly. The ailing disk has since been replaced (it was less than 3 days old when it started having problems) and the new disks have been functioning perfectly (36 days of uptime).
While there is nothing in RAIDframe that requires the use of hot-swappable drives, the availability of such drives further reduces the need to take a machine off-line in the event of a drive failure. To simulate a hot-swappable drive, the author uses an external drive with its own power switch, which allows easy ``failing'' of a drive at the hardware level. A typical test procedure is often:
start doing heavy IO on a fully functioning RAID 5 set.
``fail'' the external disk (say sd3) by turning it off. The system continues to run in degraded mode.
turn the external disk back on. Since the disk is marked as ``failed'' by RAIDframe, it will not be accessed. Turning the disk back on simulates the hot-adding of a new drive.
use:
scsictl scsibus0 scan any any
to tell NetBSD to look for ``new'' drives on the first SCSI bus.
at that point the ``new'' disk is ready for disklabelling, partitioning, etc.
do an in-place reconstruction of the failed drive:
raidctl -R /dev/sd3e raid0
(assuming sd3e is the component on the failed drive, on RAID set raid0 ).
once the reconstruct finishes, the RAID set is back in its former state, all without having taken the machine down.
IO can be taking place on the RAID set during the entire operation, which means that with hot-swappable disks the machine should never need to go down due to disk failure.
It was slow, and the kernel had to be gutted of other important things (like networking!) to make enough room, but the Sun 3/50 *did* run RAID 5 over 3 SCSI disks. The HP380 with 5 SCSI disks on one RAID 5 set and 4 HP-IB disks on another RAID 5 set doesn't exactly scream either, but it does work, and it does give a small (by today's standards) amount of redundant storage for that class of machine:
Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/sd1a 23663 19288 3191 85% / /dev/sd1d 248430 99979 136029 42% /usr /dev/raid0d 2507757 483365 1899004 20% /mnt /dev/raid1d 1900526 242974 1562525 13% /mnt2
The i386 architecture has been used for most of the RAID testing. Everything from 486dx50's through P133's to AMD K6@233's and K6-2@350's have been used in testing. Pmax, sparc, and alpha architectures have also been tested.
An often-cited disk performance benchmark these days is Bonnie. Performance values for RAID 0, and 1 sets (2GB test size) are given in Table 1. The system (at time of print) is an AMD K6-2@350, with an AdvanSys ASB3940U2W-00 SCSI controller. The disks used in the tests are Fujitsu MAE3182 18.2GB U2W drives. NetBSD-1.4.1 and NetBSD-current are the OSes used (performance under -current is slightly better).
Table 1
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
RAID-0 11819 87.5 24502 55.0 4676 16.5 10697 94.9 30217 89.4 89.8 5.7
RAID-1 10088 79.6 14948 32.7 4648 18.3 11051 95.3 19874 42.6 85.0 5.7
Note that the RAID-1 performance on reads is somewhat under-estimated by this benchmark, since RAIDframe will do load-balancing across the drives, and thus achieve closer to RAID-0 (on two drives) performance when there are two or more reads taking place.
Table 2 shows the performance of various RAID/ccd configurations on the same machine, but with different controllers (two ASUS SC875's) and different drives (Seagate ST32155W Fast/Wide). RAID-0a shows the performance of 5 of these drives in a RAID-0 configuration. RAID-0b shows the performance of 4 of these drives in a RAID-0 configuration. RAID-5 has the same storage capacity as RAID-0b, but uses 5 disks instead of 4. The ccd entry reveals that for all the additional complexity in the RAIDframe driver, the RAIDframe driver is still on par with the ccd driver when it comes to simple striping.
Table 2
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
RAID-0a 11534 95.0 20724 48.1 3571 9.7 9317 92.5 22349 56.4 33.8 2.0
ccd 13035 96.2 17690 34.1 3130 6.7 10594 92.8 18880 43.9 33.5 1.3
RAID-0b 11991 92.9 16924 36.9 3110 8.5 10025 92.0 18904 53.5 33.3 2.1
RAID-5 3069 22.1 3039 5.4 2052 7.5 10234 88.7 16174 49.0 24.4 1.9
These drives are fairly slow by today's standards, but as is seen here, even slow disks can be turned into a reasonably fast RAID set.
RAID is no longer only an option for expensive and high-end servers. RAIDframe, which started as a simulation framework, is now available for general use on a wide variety of architectures in NetBSD and OpenBSD. RAIDframe, as a fully-integrated kernel device driver, provides a robust and reliable RAID system. Real-life device failures have provided the additional proof that RAIDframe has moved well beyond the simulation stage -- it also works very well in practice.