A DIY Guide to Resolving Tape-RAID Skirmishes
Not long ago, I was working at a site where the customer was complaining about tape performance and tape archive reliability issues. These performance issues manifested themselves in a number of different ways: Backups were running very slow, with many errors occurring, and older backup tapes were error prone. Most of the time, tape and RAID have a relationship akin to peanut butter and jelly or bagels and lox. What happens when they don't? Here's how one organization found the answer.
For a site that needed high reliability, fast restoration and long-term access to data, this was a bad situation. From a business continuity perspective, the status quo was not going to fly with management or the users.
So began a search for the root of the problem. If reading or writing to tape was slow, then what was the cause?
Analyzing Tape Performance
First, the sar command was used look at tape performance and understand how bad things really were. After looking at the output from sar, it was clear that the tape drive was running poorly based on the reported transfer rate to the tape drive. When doing this type of work, you should know the performance of the tape drive for both compressed and uncompressed data. At the bottom of this article, you'll find a table displaying some common tape drives and their performance, according to vendor Web sites.
Determine the expected performance of a tape drive, including the expected compression performance, as part of the performance analysis process. Compression varies with the type of data, but enterprise drives from IBM and Sun/STK generally will provide better compression than LTO drives, given the compression chipset being used. For more information, see Back to the Future with Tape Drives on Enterprise Storage Forum.
Operating System and Application
Familiarity with the tape drive type yielded a quick estimation on compression using gzip and a random sampling of files. As expected, the tape drive data rate was running at less than 30 percent of what it should be. Since tape drives write only the data they get, it was time to check the connection to the tape drive, operating system settings, and RAID configuration and settings.
The tape HBA settings were set correctly: HBA was set for tape emulation and a large command queue, and no errors were being received on the HBA.
The next step was to look at the operating system configuration and information for the application writing the tapes. Here, several problems were found.
The operating system was not tuned to allow requests greater than 128 KB to be read or written. Since the tape block size is 256 KB, this was causing multiple I/O requests for a single tape block.
The application writing/reading the tape drive had only four readahead/writebehind buffers. Given the latency from the RAID and to the tape drive, this could be a serious problem, but bigger problems lurked.
When the RAID was examined, the pieces of the puzzle became clear, and although I often recommend against doing what this customer did, it does work in some environments. The customer configured at RAID-5 5+1 with 64 KB segments. That means that a full stripe of data would be 320 KB, while a read or write from the tape device would be 256 KB data was being read and written to the tape at a different block size than the device was prepared to handle, resulting in significant work for the kernel. Since the data rate of the tape drive with compression was nearly the rate of the 2 Gb RAID, the problem was clear. This non-power of two LUN size was a significant mismatch, which was the main cause of the performance problem. Take the following example:
|Disk 1||Disk 2||Disk 3||Disk 4||Disk 5||Disk 6|
|Parity||64 KB||64 KB||64 KB||64 KB||64 KB||320 KB Stripe 1|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 1||Tape Data Write/read Block 2|
|64 KB||Parity||64 KB||64 KB||64 KB||64 KB||320 KB Stripe 2|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 2||Tape Data Write/read Block 3|
|64 KB||64 KB||Parity||64 KB||64 KB||64 KB||320 KB Stripe 3|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 3||Tape Data Write/read Block 4 Partial|
Clearly it was not reading or writing full stripes of data from the RAID device, and after the first read or write, it will do a head seek for every I/O, since it is not reading or writing full stripes. Every fifth block read or written will not require a head seek, but this will not provide good performance if almost every I/O requires reading two stripes. Even if the stripe size was a large number, say 256 KB, it might improve performance for this type of configuration, but it is still not optimal.
One of the reasons this will become an issue in the future is that tape block sizes will increase over time, so even if you reduce the impact of the problem today by using larger-per-disk allocations, the likelihood that this will alleviate the problem in the future is low. Using the above example with say, 256 KB per disk allocation, creates a stripe size of 1280 KB. If tape block size moves to the range of 1 MB or greater, you'll have the same problem all over again.
The Power of Two Problem
Why do vendors and owners set up RAID devices to have non-power of two allocations? (Power of two, meaning data drives; e.g., 8+1 RAID-5 is a power of two.) Three common reasons are:
- From the vendor side, a number of vendors do not support power of two RAID allocations for their RAID devices using RAID-5. This is ignorance on their part of many of the applications that are out there. Databases index files are often powers of two, database table sizes are often powers of two, applications doing reads for Web servers are very often powers of two, the C Library buffer size (fwrite/fread) allocations are powers of two, as are many other applications. Readahead, although helpful, assumes that the file system sequentially allocated the data, which often is not true.
- RAID owners are concerned with write reconstruction time and often set up the configuration around the time it takes to reconstruct a LUN.
- RAID owners often set up devices based on drive count and wasted parity drives. If you buy 10.2 TB in 300 GB drives and up with 4+1 RAID-5 LUNs, you would have six LUNs for a total of 7.2 TB of data space. Using the same example and setting up as 9+1 RAID-5 would give you 8.1 TB of data space, while a RAID-5 16+1 would give you 9.6 TB (but no hot spares). This is one of the main reasons for such odd combinations.
Powers of two are important for tape performance for RAID configurations, but powers of two are also important for many other application types. In the high performance computing world, many applications require powers of two for their allocation in memory and in many cases for their allocation of CPU counts, and the same is true for I/O to storage for algorithms, such as FFTs.
Why people use non-powers of two, or even worse, prime numbers for RAIDs, is a sign that there must be a great deal more education on data movement systems issues. The problem is that everyone looks at their own hardware and software design and development in a vacuum. This is true not just for hardware and software developers, but for the system architects and designers who configure systems.
All and all, there are worse things in the world than a poorly configured RAID and tape system, but if your data is really important, it is critical to think about the architecture globally.
|Drive||Native Performance in MB/sec||Compressed Performance in MB/sec|
|Quantum DLT 600A||36||72|
|Sun/STK T100000||120||360 (claims this performance with future 4 GB support)|
This article was originally published on Enterprise Storage Forum.