Analysis: Forget Fermi, bring on the Atom.
Nvidia might have an unlikely competitor thanks to the mess it made with Fermi.
Fermi chips on Nvidia's high performance computing (HPC) Tesla boards have disappointed, not just in terms of performance but more importantly with the firm's continued inability to rein in power consumption. The ill-advised design choices it made have, as The INQUIRER reported first, left some of its former loyal supporters such as Silicon Graphics International (SGI) looking elsewhere for viable alternatives.
For HPC vendors the Fermi problem isn't just its lower than expected performance but in the reality of data centres, where Nvidia's Fermi based Tesla cards limit the amount of hardware a customer can put in a rack before running out of cooling capacity. The Fermi architecture was designed to extend Nvidia's rapid expansion in the HPC market, so the news that the Green Goblin had to not only cut back on the number of 'Cuda' codes but increase the thermal design power (TDP) as well was a double whammy.
As Nvidia got too clever for its own good, it forgot what had made its GPGPUs so popular. SGI's senior director of server product marketing Bill Mannel told The INQUIRER that even though it had invested heavily in Field Programmable Gate Array (FPGA) through its RASC programme, when surveying alternatives, including the Cell architecture, Nvidia's Cuda represented the best fit for the firm at the time. In some ways SGI's decision was the correct one, as Mannel says that many of the other options have since "fallen away". Now there's a chance that Nvidia will join that list.
To understand why power draw and cooling are so important in a graphics chip that's destined for HPC environments, one must look at the mindset behind running equipment that is designed for remote administration. Those who have spent any time working in a data centre will attest to the fact that they are hostile places to work.
Servers and cooling equipment create an oppressive cacophony of noise and when combined with the low ambient temperature, which Fermi does so well to raise, even the simplest of tasks can become tedious. Sending an engineer can be especially costly given the growth in modularised data centres, allowing for equipment to be dumped in the middle of nowhere to make use of lower data centre costs.
For this reason, vendors such as SGI design servers so that the majority of maintenance can be done remotely, which leaves hardware failure as the primary reason to ever send engineers down to the data centre. However, that's expensive. Therefore when Mannel admits reliability is affected by "hot cards" such as Fermi, it's hardly surprising that SGI is thinking of jumping ship.
Thanks to its acquisition of Cray back in 1996, SGI has access to several exotic HPC cooling approaches. The same Cray engineers who designed some of the supercomputing icons from the 70s and 80s, according to Mannel, are still on the payroll. Coming from a company that was known for over engineering its products, those engineers, who Mannel calls "a conservative bunch" when it comes to pushing the thermal design envelope, are wary of Fermi, and rightly so. The engineers lighten up when it comes to 'cloud computing' clusters, presumably because a limited number of failures can be masked through the abstraction of quantity.
The aura surrounding Fermi cards is enough to instil fear into systems vendors. SGI not only has to put "an additional amount of work" in testing systems but even those which do not ship with a Fermi board have to be designed "with the knowledge that a [Fermi] Tesla board may be added". Given that Mannel already alluded that "hot cards" can lead to a "worse failure profile", how long can Nvidia expect vendors to go the extra mile to design, implement and service boards that give them so much trouble at every stage of their lifecycle?
theinquirer.net (c) 2010 Incisive Media
Issue: 137 | June, 2012