The
concept of parity-based RAID (levels 3, 5 and 6) is now
pretty old in technological terms, and the technology’s limitations will become pretty
clear in the not-too-distant future — and are probably obvious to some users already. In
my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive
to two is simply delaying the inevitable.
Looks like the long-running data storage technology may be headed for trouble. We examine the problem — and various potential solutions.
The bottom line is this: Disk density has increased far more than performance and hard
error rates haven’t changed much, creating much greater RAID rebuild times and a much
higher risk of data loss. In short, it’s a scenario that will eventually require a
solution, if not a whole new way of storing and protecting data.
We’ll start with a short history of RAID, or at least the last 15 years of it, and
then discuss the problems in greater detail and offer some possible solutions.
Some of the first RAID systems I worked on used RAID-5 and 4GB drives. These drives
ran at a peak of 9 MB/sec. This, of course, was not the first RAID system, but 1994 is a
good baseline year. You’ll need to click on the image below for how the RAID rebuild
picture has changed in the last 15 years.
A few caveats and notes on the data:
- Except for 1998, the number of drives that were needed to saturate a channel has
increased. Of course some vendors had 1Gb Fibre Channel in 1998,
but most did not. - 90 percent of full rate is in many cases a best-case number, and in many cases,
full rebuild requires the whole drive. - The bandwidth assumes that two channels of the fastest type are available to be
used for the 9 or 10 (RAID-5 or RAID-6) drives. So for 2009, I am assuming two 800
MB/sec channels are available for rebuild for the 9 or 10 drives. - The time to read a drive is increasing, as the density increases exceed performance increases.
- Changes in the number of drives to saturate a channel – along with new
technologies such as SSDs and pNFS – are going to affect channel performance and
cause RAID vendors to rethink the back-end design of their controller.
We all know that density is growing faster than bandwidth. A good rule of thumb is
that each generation of drives will improve bandwidth by 20 percent. The problem is that
density is growing far faster and has been for years. While density percentages might be
slowing now from 100 percent to 50 percent or less, drive performance is pretty fixed at
about 20 percent improvement per generation.
Using the sample data in the table above, RAID rebuilds have gone up more than 300
percent over the last 15 years. If we change the formula from 90 percent of bandwidth to
10 percent of disk bandwidth – which might be the case if the device is heavily used with
application I/O, thanks in
part to the growing use of server virtualization – then rebuild gets ugly pretty, as in
the RAID rebuild table below.