More on supercomputers
Sitting at No. 10 in the supercomputer rankings is Red Sky by Sandia National Labs in Albuquerque, N.M. It is composed of Oracle Sun x6275 blades using a total of 42,440 cores of Intel Xeon 5500 series processors with 64 TB of RAM and delivering 500 TFlops of transactional processing power. This is supported by a bank of Oracle Sun storage. Red Hat Linux operating systems are used within the cluster.
What happens when two supercomputers join together and operate as one? With Red Sky/Red Mesa, a collaborative effort involving Sandia National Labs, the National Renewable Energy Laboratory and Oracle/Sun, we’re about to find out.
“We are using Oracle’s Sun blades and Sun storage in a private cloud setting,” said John Zepper, senior manager of computer systems at Sandia National Laboratories.
This is actually two supercomputers operating as one. Known as Red Sky/Red Mesa, it is a collaborative effort involving Sandia, the National Renewable Energy Laboratory (NREL), and Oracle/Sun (NASDAQ: ORCL). Red Sky is a 325 TFlop system supported by the Red Mesa 180 TFlop system.
“Sun won the bid for the new supercomputer, and there are now almost 43,000 cores between both machines,” said Zepper.
The Oracle Sun X6275 blade uses the Intel Nehalem architecture, which was designed for compute-intensive applications in general and commercial high performance computing (HPC) environments. Sandia uses InfiniBand, which leverages Intel’s QuickPath technology to provide higher bandwidth and lower latency.
Zepper explains the use of InfiniBand. For every node, technicians must stretch a cable to the main switch. Typically, you end up with too many cables.
“InfiniBand has helped us reduce the volume of cabling significantly,” said Zepper.
Integrated InfiniBand QDR Host Channel Adapters (HCA) and Quad Data Rate Switched Network Express Modules (QNEM) are used to interconnect the blades housed in Oracle’s Sun Blade 6048 chassis.
“We had some issues with QNEM so Oracle worked with us to modify it to work optimally in our environment,” said Zepper.
The resulting switches, which Sandia and Oracle/Sun designed together, were used to build the first implementation of a 3-D torus interconnect topology using InfiniBand networking. The system is also believed to be the first InfiniBand-based system that uses optical interconnect cables exclusively.
Zepper reports that the organization removed all the hard drives from the x6275 blades. Booting them over InfiniBand, he said, allows the organization to dispense with its Ethernet infrastructure for Red Sky. That added up to a cost savings of 20 percent on each blade.
“By booting off InfinBand, we have seen an improvement in performance of four to five times over our older infrastructure.”
Power and Cooling
Zepper gives a dramatic example of the rapid acceleration of compute performance while also lowering the footprint. The older supercomputer had 17 racks. Sandia can now get that amount of juice in one rack of blades.
The downside, of course, is the amount of heat generated. Therefore, the lab redesigned its cooling setup to increase efficiency and reduce costs. Zepper describes it as the most energy efficient compute platform Sandia has deployed to date. On the power and cooling side, it comprises Emerson/Liebert XDP Units, APC Power Distribution Units (PDUs), and a Cooligy Glacier Door for the rack.
“The enclosure door uses refrigerant and cools the blades, not the room,” said Zepper. “This saves $100,000 a year on power alone.” The Liebert XDP units at the room’s perimeter are used to keep the refrigerant cool and allow the Lab to load up to 35 kW per rack.
This direct-cooled system delivers cooling at 0.13 kW per kW cooling. Zepper reports that this cooling process reduced chiller plant consumption tons cooling by 37 percent, water consumption by 5.4 million gallons per year, and chiller energy consumption by 77 percent.
A good metric for data center efficiency is Power Usage Efficiency (PUE). You divide the amount of power entering a data center by the power used to run the computer infrastructure within it to arrive at a ratio. The closer you are to 1, the better. The facility has achieved a PUE of 1.27 even when additional enterprise computing equipment beyond the Red Sky/Red Mesa supercomputer is factored in.
“It is outstanding to achieve a PUE of 1.27 with a 43,000 core machine,” said Zepper.
He also commented on the APC PDUs that provided 288 kW in a half-rack compared to four racks in earlier generations.
On the storage side, Sandia has 148 Oracle Sun J4400 disk arrays, which provide 6 PB of storage for the supercomputing cluster. The disk itself is described as 1.8 TB Seagate SATA drives running inside Sun Just a Bunch of Disk (JBOD) boxes. The Lustre file system runs inside the cluster with a throughput of 20 GB/sec across InfiniBand to the Lustre file systems.
“Lustre has I/O controllers that aggregate the data and allow hundreds of users to access our machines,” said Zepper.
Most importantly, the Red Sky/Red Mesa platform has brought about a major shift in the amount of time needed to simulate complex fuel models. Zepper said it had dropped from four to six months to four weeks. That’s the whole point for the facility — to allow researchers to accelerate the pace at which they can address lab work.
Drew Robb is a freelance writer specializing in technology and engineering. Currently living in California, he is originally from Scotland, where he received a degree in geology and geography from the University of Strathclyde. He is the author of Server Disk Management in a Windows Environment (CRC Press).