Fast NASA Action Begets World's Largest Linux Supercomputer
Mar 31, 2005 5:00 AM PT
The space shuttle Columbia disaster on Feb. 1, 2003 sparked development of the world's second largest supercomputer, a system with 10,240 Intel Itanium 2 processors capable of performing 51.87 trillion calculations per second.
Not only did the supercomputer, appropriately dubbed Columbia, stretch Linux's performance boundaries, but it also strained NASA's installation capabilities and the manufacturing ability of the hardware supplier, Silicon Graphics Inc. That's because the supercomputer went from design to deployment in 120 days rather than the years typically associated with building such a complicated system.
The rapid installation schedule stemmed from the U.S government's Return to Flight initiative, which was launched with the goal of taking the steps necessary to ensure that future shuttle flights would be safe.
"Because another shuttle launch was set for the end of the year, we quickly needed to put a computer system in place that could provide us with information about the impact of any problems with the take off or any of the equipment," stated Dr. Walt Brooks, division chief at the NASA Advanced Supercomputing center.
Need for Speed
As a result, governmental bureaucratic red tape, which typically adds years to the purchasing process, was cast aside and a team was assembled to select a new supercomputer. At the time, NASA was thinking about adding a supercomputer with 10 teraflops to 15 teraflops of computing power to its arsenal.
"We were running into performance problems with some of our more sophisticated applications," explained NASA's Brooks. The agency examined linking thousands of dual-processor commodity servers into a sprawling cluster, but that approach did not mesh with the scientific applications that the agency ran. Instead, the agency opted for a large multiprocessor system.
NASA had worked with SGI to build Kalpana, a 512 processor system named in honor of Kalpana Chawla, a NASA astronaut lost in the Columbia accident. Satisfied with the results from that project, the agency turned to SGI to build the new system. Once the bastion of proprietary supercomputers, the high performance computing market had been shifting to more commodity components and open-source software, like Linux -- changes that help to lower the cost of these multi-million dollar systems.
Once the system design was established, SGI and NASA had to build the new supercomputer. Since the installation timeframe was unprecedented, Dick Harkness, vice president of SGI's manufacturing facilities who oversee about 200 employees, threw out all of the company's traditional project management techniques. "We literally put new business processes in place as the project unfolded," he told LinuxInsider. "It was obvious that the established ones would not be able to scale up and support the enormity and complexity of the tasks involved with delivering the new system."
SGI had perfected the process of building two supercomputer processor nodes simultaneously and had to increase that number so that six were being constructed at one time.
"We had to continue filling orders for our other customers as we put the NASA system together," noted Harkness. To meet its shipment goals, the company expanded its manufacturing facility and brought in scores of outside contractors. "Many employees volunteered to work nights and weekends in order to help us meet our production deadlines," Harkness said.
Is the Pipeline Full?Keeping things moving meant ensuring that there were adequate components in the pipeline. "We knew what we needed to deliver the system, but what we learned was we needed more insight into our vendors' supply chains, so we could see how likely, or unlikely, it was that they could deliver needed components to us on time," SGI's Harkness explained. As a result, the firm took an interest in items like cabling and LED assembly, which before were of little to no concern.
There were also technical challenges. Squeezing more than 10,000 processors into NASA's supercomputing room in Mountain View, Calif., meant Columbia had to incorporate eight 512-processor nodes configured in a new high-density, high-bandwidth version of the SGI Altix 3000 system. Also a 440-terabyte SGI InfiniteStorage system was used to store and manage the terabytes of new data that would be generated every day.
Because the supercomputer was so vast and the circuitry so dense, the hardware supplier had to improve its cooling system. To ensure that the new system would not overheat, SGI designed new water-cooled doors -- the first time such features were offered on systems other than Cray supercomputers.
NASA also had to retool its operation. The agency had to redo the plumbing for its supercomputer water cooling system, so it could handle heat thrown off by the new system. The agency's supercomputing space had to be reconfigured without disturbing existing users, and the agency had to prepare to support a larger number of end users.
Live Beta Testing
Surprisingly, these problems proved more difficult than the design and deployment of the supercomputer itself. SGI completed basic quality assurance tasks, but the devices could not be fully tested until they were installed at NASA. In effect, NASA employees used beta versions of new hardware that were less than a week old, and in some cases, the processors were running only 48 hours after they had left the assembly floor. This beta testing model was quite different from the typical testing process which takes months, sometimes even years. "We were concerned about software installation and application compatibility but fortunately encountered no major problems," said Jeff Greenwald, senior director marketing at SGI.
The SGI Altix 3700 supercomputer presented NASA with a significant performance boost. The supercomputer relies on industry standard 64-bit Linux microprocessors, and each node scales up to support 256 processors with 3 TB of memory. Round trip data transmission speeds can take as little as 50 nanoseconds, so the supercomputer completes 42.7 trillion calculations per second sustained performance on 16 of 20 systems, an 88 percent efficiency rating based on the LINPACK benchmark.
The added horsepower has made it easier for NASA to complete certain tasks. Hydrogen gas flow chamber simulations in the space shuttle propulsion systems can now be done in days instead of weeks. New applications include earth modeling, space science and aerospace vehicle design, so scientists are now able to more easily map global ocean circulation, predict large scale structures in the universe, and examine the physics of supernova.
Use of the system is expected to expand. "As word about the processing power we now have has spread, scientists have found new applications that can take advantage of it," concluded NASA's Brooks. "We are quite busy, not as busy as we were during the installation, but still there is plenty for us to do."