GPU advancesOn the other hand, GPU priorities are a little different. Multiplying the thread counts and processing unit numbers here was more important than the power of each processing unit within the GPU, as the typical graphics pipeline is far more predictable and more parallel than most tasks run on general purpose CPUs. So, if an AMD/ATI HD5870 GPU has 1,600 simple shaders in parallel, or an Nvidia GT300 has 512 more complex and more CPU-like shader cores, the GPU looks way different from 4 to 8 CPU cores on a processor die.
Then, despite the four times slower average clock speed for the core, or three times for the shaders, versus the standard CPUs, the vast parallelism of GPUs allows far higher theoretical computational power. When it comes to the double-precision floating-point throughput we discussed before, let's look at what AMD/ATI and Nvidia might have in a few months, in the same timeframe with Intel's Gulftown and AMD's Magny Cours.
On the ATI side, a speed update for the HD5870, probably something called HD58X0, should be there with the refinement and stepping updates of the R800 family dies. If running at a default 950MHz GPU and proportionally sped up shaders, the new device should reach 3 TFLOPS in single-precision floating-point and, more importantly, 600 GFLOPS in double-precision floating-point, both IEEE compliant. In fact, some of the overclockable HD5870 entries, like those from Asus, already provide such speeds.
So, if your code can run efficiently with AMD Stream libraries and such, a dual-GPU hypothetical HD58X0 card will likely give you 1.2 TFLOPS of double-precision floating-point power for precision runs, and 6 TFLOPS of single-precision floating-point for parametrisation and estimation runs. Now, just make sure there is enough memory in there to hold the data sets of multiple threads without running over the PCIe bus to the main memory, as, despite the limited GPU caching, the slow link can cut the performance by as much as an order of magnitude. Therefore, 2GB of GDDR5 memory per GPU is strongly recommended, if doing GPU computation.
By early next year, we all hope that Nvidia's GT300 will already be launched and shipping, because if it isn't, that will be big trouble for the green graphics gang. Let's assume it does. With 512 shader processors that can do either 512 single-precision or 256 double-precision fused multiply adds per clock, that would at, say, 1.8GHz shader clock, give you 1.8 TFLOPS in single-precision mode or 900 GFLOPS in double-precision mode. Not bad at all.
But what's far more interesting is that the GT300 promises to enable a far greater range of codes to make use of all that power. With an overall architecture far closer to a CPU this time, many normal C, C++ and Fortran codes should be able to run on it out of local GPU memory. With up to 6GB of onboard memory in the first iteration, and 8GB in the subsequent one, the latter with a 512-bit memory bus, the GT300 should be quite a bundle.
What the GT300 misses to really be a true CPU and run all the usual stuff, including booting an OS, are a full fledged memory management unit (MMU), for virtual to physical memory translation, and a front-end general purpose CPU instruction set. That's why I was saying many times that Nvidia should have had a real CPU, like say the Alpha did, which would provide both ultrahigh performance better than the X86 to fill in that niche, and also offer the built-in capability to run X86 code very fast via a real-time translator like the famed FX!32 without having to pay for an X86 license.
Don't forget that the last planned Alpha incarnation, the EV9 21564, was supposed to have a kilobyte-wide (yes 8,192 bits) vector unit able to put out over 100 GFLOPS in double-precision floating-point, some 9 years ago. Imagine what would it be able to achieve today.
The Tegra and other ARM-based stuff is simply too weak to be a front end for a gigantic TFLOPS-class GPU. For a proper "fusion" at the system level, you need very fast and wide main system memory, a multi-channel multi-gigabyte setup at least, to feed it from the CPU side, and very fast multiple HyperTransport or QuickPath or Alpha EV links to connect multiple GPUs with the main CPUs for efficient coherent shared memory access between GPU and CPU memory banks. In the absence of a general purpose CPU that's able to do this, Nvidia might have to negotiate a QPI license with Intel to directly link its GPUs to the Westmere and future CPUs, in order to enable more of the coprocessor model here. But wait, wouldn't the long delayed Larrabee be gunning for the same role?
I'll have more on this, and the 'ideal' CPU-GPU system configuration, in Part 2.
Issue: 133 | February, 2012