If you’ve been following the server market and Sun in particular, you may be wondering why all the fuss over Sun’s latest processor in the T1000 and T2000 servers. The servers are said to outperform anything on the market, and if you’re anything like us, you’re wondering how (and even if) this possible.
Sun Microsystems’ new T1 processor has been described as both innovative and game-changing. More important, however, is how the chip fares in a real-world server configuration. We take the T2000 out for a spin to find out.
When Sun sent us some reading material and a loaner T2000 to play with, we were eager to find out.
The review model we tested was a six-core T2000. The servers come in 4, 6 and 8 core single-processor options. Other palette-wetting features include:
- UltraSPARC T1 processor, up to eight cores, 3MB L2 cache (shared)
- 16 DDR2 (ECC/registered) slots, scalable to 32GB maximum
- Redundant hot-swap power supplies and fans
- Up to four 10K SAS hard drives, hot-swapable
- Hardware RAID 1 (mirroring)
- Four gigabit network interfaces
- Three PCI-E slots
- Two 64-bit PCI-X 133MHz slots
- Advanced Lights Out Manager, for remote administration
That’s one serious server, with a host of innovative features due largely to the new T1 processor.
Chip Multi-Threading (CMT) technology is a hardware-based technology that enables Sun to manage threads in hardware. You’ll notice that the T1 processor has eight cores, but Sun persistently talks about 32 threads. This is because on each core, four threads are ready to execute. The threads share the same pipeline, and because the hardware schedules and switches them, context switches are a single, fast operation. The operating system can actually see 32 processors. It schedules threads on all of them. The operating system understands that once a core has multiple threads, it is transparently run to completion per-core without software scheduler intervention. This processor is the first of its kind; however, we don’t expect it to remain competition-free for long.
What this processor design translates to in real-world applications is that a server will hum along just as happily running one thread as it will running 32. The implications here may be a bit difficult to understand. This doesn’t mean you really have a 9 GHz processor; it also doesn’t mean any single transaction will run faster. In fact, the T1 has a relatively low clock speed compared to the power-intensive and hotter Opteron and Xeon processors. When most of a process’ time is spent waiting on memory anyway, it makes sense that each core of the CPU have something to do rather than sit idle.
And this is how the T1 outperforms everyone else in high-utilization situations. Unfortunately, low clock speed means slower operations. There’s no getting around that.
Sun decided to go with a slower clock (1.0 to 1.2 GHz) for the T1 chips for two reasons. First, there’s a huge market for database and Web servers that must handle thousands of transactions per second. The T1 processor is ideal for these situations, regardless of its clock speed. Second, Sun has an affinity for the power craze. Sun even created its own metric to help organizations assess how efficiently its servers are operating.
The SWaP (Space, Watts, and Performance) metric calculates a server’s usefulness based on how many watts of power it consumes relative to its performance. Large data centers are constantly struggling with power, which makes this a wise move on Sun’s part. The T1 processor is much more than just a power-saving processor though.
Sun’s claims of crypto acceleration on-chip left us yearning, at first. There is one Modular Arithmetic Unit (MAU) per core, and it provides multiplication and exponentiation assistance for SSL processing. The actual FPU (Floating Point Unit) is shared between all cores. Although the MAU at first appeared to be a Band-Aid to alleviate the shared FPU issue, it has since proven itself more than worthy. At first, we ran our own standard openssl build to speed test signing 1024 bit DSA keys:
> openssl speed dsa1024
Eh. We were unimpressed with the modest (1.8GHz). An Opteron box can perform much better:
sign verify sign/s verify/s 0.000535s 0.000609s 1867.6 1642.9
We were not amused. The T2000 is supposed to have cryptographic acceleration, after all. Indeed it does, but the application must understand how to use it. The openssl from /usr/sfw/bin understands:
> /usr/sfw/bin/openssl speed -engine pkcs11 dsa1024 . nengine "pkcs11" set. sign verify sign/s verify/s 0.0001s 0.0001s 15483.0 13818.4
The MAU clearly does its job. We wouldn’t recommend using any SSL applications that can’t take advantage of a crypto engine, but as you can see, the order-of-magnitude difference makes the T2000 capable of serving the busiest SSL-based services. The fastest Opteron we had on hand wasn’t able to touch the T2000. The Opteron 275 (2.2GHz dual-core) signed only about 2,750 keys in 10 seconds, compared to the T2000’s 15,500.
Single-thread performance is admittedly slower, whereas high throughput computing is exceedingly impressive. The CPU design’s main drawback is that when running a single thread on an idle core, the thread still doesn’t have exclusive access to the core’s resources. Coupled with a simpler pipeline, this makes lightweight (single-threaded) applications suffer. For example, 101.3 keys signed in 10 seconds is frightening; a Pentium III 550MHz outperforms that by 150 percent. Sun’s documentation doesn’t bluff its way through this, but it also isn’t easy to find. We find it prudent to again stress that a T2000 server should be considered only for high-utilization applications.
A listing of benchmarks and impressive data can be found on Sun’s Web site. We found it more important to talk about the reasoning behind this great performance, but we also ran a few tests just to make sure everyone else was telling the truth. For example, we tasked a V240 dual processor server with building the GNU C compiler in a parallel fashion. Giving it 30 processes (make -j30), as expected, brought the machine to its knees. On the T2000, however, ‘mpstat’ came to life, and ‘prstat’ showed ‘cc’ processes on each of the 24 processors. The really interesting part is that although the load average was more than 25, you’d never know it: It was as responsive as always. Five minutes later we had gcc-4.1 built and ready to use. The T2000 is truly an amazing machine.
What do you want to task the shiny new T2000 with? Tons, and that’s the design goal. Lots of busy applications running many parallel threads of execution is the T2000’s niche. It isn’t a niche market, however. Most application, database, and Web servers don’t need to perform much floating-point math, so the T2000 is well-suited for the most important roles in your data center.