RAID's Final Countdown
The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable.Looks like the long-running data storage technology may be headed for trouble. We examine the problem -- and various potential solutions.
The bottom line is this: Disk density has increased far more than performance and hard error rates haven't changed much, creating much greater RAID rebuild times and a much higher risk of data loss. In short, it's a scenario that will eventually require a solution, if not a whole new way of storing and protecting data.
We'll start with a short history of RAID, or at least the last 15 years of it, and then discuss the problems in greater detail and offer some possible solutions.
Some of the first RAID systems I worked on used RAID-5 and 4GB drives. These drives ran at a peak of 9 MB/sec. This, of course, was not the first RAID system, but 1994 is a good baseline year. You'll need to click on the image below for how the RAID rebuild picture has changed in the last 15 years.
A few caveats and notes on the data:
- Except for 1998, the number of drives that were needed to saturate a channel has increased. Of course some vendors had 1Gb Fibre Channel in 1998, but most did not.
- 90 percent of full rate is in many cases a best-case number, and in many cases, full rebuild requires the whole drive.
- The bandwidth assumes that two channels of the fastest type are available to be used for the 9 or 10 (RAID-5 or RAID-6) drives. So for 2009, I am assuming two 800 MB/sec channels are available for rebuild for the 9 or 10 drives.
- The time to read a drive is increasing, as the density increases exceed performance increases.
- Changes in the number of drives to saturate a channel - along with new technologies such as SSDs and pNFS - are going to affect channel performance and cause RAID vendors to rethink the back-end design of their controller.
We all know that density is growing faster than bandwidth. A good rule of thumb is that each generation of drives will improve bandwidth by 20 percent. The problem is that density is growing far faster and has been for years. While density percentages might be slowing now from 100 percent to 50 percent or less, drive performance is pretty fixed at about 20 percent improvement per generation.
Using the sample data in the table above, RAID rebuilds have gone up more than 300 percent over the last 15 years. If we change the formula from 90 percent of bandwidth to 10 percent of disk bandwidth - which might be the case if the device is heavily used with application I/O, thanks in part to the growing use of server virtualization - then rebuild gets ugly pretty, as in the RAID rebuild table below.