A few years back, a discussion of supercomputing architectures would have turned to talk of Cray or IBM eServer systems. You probably wouldn’t have heard much about running Linux on Intel processors. However, with the maturation of open-source technology and the development of Intel’s 64-bit technology, the supercomputing dialogue is different these days.
NASA’s Columbia supercomputer, which is built on 20 SGI Altix clusters each running 512 Intel Itanium 2 processor, is exploring the outer limits of high-performance computing.
For a glimpse into the future of high performance computing, look at NASA’s 10,240-processor Columbia supercomputer, which achieved a rate of 51.87 trillion calculations per second (teraflops) earlier this month on the Linpack benchmark — earning it the number 2 spot on the Top 500 supercomputing list.
“It used to take us six months to a year to model a decade’s worth of ocean data. Now we can do that in a day or two.”
—Walt Brooks, division chief, advanced supercomputing division, NASA.
|
Columbia was built from 20 Altix systems by Silicon Graphics Inc. (SGI), each powered by 512 Intel Itanium 2 processors and running Red Hat Linux. Columbia is housed at the NASA Advanced Supercomputing (NAS) facility at Ames Research Center in Mountain View, Calif.
From Months to Days
“NASA scientists used to have to wait on supercomputing resources, and the systems they used might take months to complete complex simulations,” said Walt Brooks, division chief, advanced supercomputing division at NASA. “Now, we have enough capacity for everybody and can do difficult computations in days.”
NAS has been at the forefront of supercomputing for decades. In the ’80s, it began using Crays, and currently has a 16-processor Cray X1 on the floor. In the late ’90s, the organization introduced Origin supercomputers by SGI, running IRIX, a UNIX variant. In 2003, NAS had one teraflop of Origins on the data center floor. At that time, though, it began to test SGI’s newest line — the Altix platform. Technicians at NAS installed an Altix 3700 system one year ago. It became, in fact, the first 512-processor Linux/Itanium combo with a single system image (SSI).
“SSI creates a good environment for code development in areas like computational fluid dynamics, weather patterns and ocean currents,” said Brooks. “It offers low latency and is extremely user friendly.”
Impressed with the results, NAS then decided to go all out with Altix as the core of a new high-performance computing architecture. According to Bob Ciotti, terascale systems lead at NAS, the elements of the system are the following:
- two Brocade Silkworm switches, each with 128 ports
- disk drives from Engenio — 200 TB of Fibre Channel and 250 SATA
- memory technology from Dataram and Micron Technology
- a 288-port InfiniBand interconnect from Voltaire
- a 440 terabyte SGI InfiniteStorage SAN
- a total of 20 Altix systems — 12 are Altix 3700’s and the latest eight are Altix 3700 Bx2’s
The Bx2 doubles the density of the Altix line, making it possible to pack 64 processors into a standard rack. SGI’s newest model also adds cooling doors. Chilled water is brought in and coupled into radiator loops within the door to keep the temperature down.
With the experience gained on the first 512-P system, technicians from NAS and SGI installed this giant in 120 days. Each 512-P unit behaves like a single system using the SGI NUMAflex shared-memory system. It incorporates a low-latency, high-bandwidth interconnect that is designed to maintain performance as it scales. This is designed to permit more efficient resource management and better access to local and remote memory without any bottlenecking at switches, backplanes or interconnects.
The entire memory space is shared so large simulation models can fit into memory with no programming restrictions. Rather than waiting for all of the processors to complete their assigned tasks, Columbia reassigns resources to more complex areas.
“NUMAflex gives us performance we need to ensure our researchers can focus on their science, not the technology used in their computations,” said Brooks.
Brooks reports relatively few bugs during and after installation. Having achieved familiarity with the platform via the first 512-p unit, NAS used that unit as its basic building block. Minor issues with memory and one pinhole leak in a cooling door were all that occurred over the four months of deployment. Brooks recommends that others get familiar with the system before embarking on an aggressive supercomputing project.
How Fast Is It?
In assessing the value of Columbia, the gain in speed of computations is startling. On NASA’s previous supercomputers, simulations showing five years worth of changes in ocean temperatures and sea levels were taking a year to model. Using the Altix-based, scientists can simulate decades of ocean circulation in a few days. The resulting simulations are also in greater detail than before.
“It used to take us six months to a year to model a decade’s worth of ocean data, and now we can do that in a day or two,” said Brooks. “This accelerates the rate of discovery as the best science teams in the world are not waiting around for access to computing resources, or having to wait months to receive the answers they need.”
Many areas within NASA have been the beneficiary of this improved computing power. In the Space Shuttle program, for example, it used to take two to four months for a single reentry analysis — no good if you need the data during an emergency. Using Columbia, Brooks said, a dozen runs were just completed in a 24-hour period.
Other gains include weather pattern modeling and aircraft design. Instead of two days warning on a hurricane path, he said, it can now be predicted five days before the storms reach landfall. And aircraft design analysis has gone from years of work to a single day.
While Columbia currently runs on Red Hat Linux 2.4, NAS plans to move shortly to SUSE Linux 2.6 as that is the distribution preferred by SGI. It will also accommodate Intel’s about-to-be-released Itanium 2 processors with 9MB cache.
High Praise for High Performance Computing
“NASA’s Columbia system signals a new approach to supercomputing design, one in which the most powerful computer systems can be deployed in weeks rather than many months or even years,” said Earl Joseph, research vice president, high-performance systems practice for IDC. “Columbia represents a new breed of large scale supercomputer, one that can be replicated at any national laboratory or university.”