Part 1: Fusion or fission?
Over the past two years, CPU and GPU capabilities have started to converge, but at a snail's pace. While GPUs can now handle branching code and double-precision IEEE floating-point operations, we still don't have a GPU that can run generic C or Fortran code, at least not until the Nvidia GT300 comes out, or that can access system memory and support the paged virtual memory model to boot a modern OS, and not even the GT300 will be able to do that.
On the other hand, any attempts by CPUs to handle real time 3D graphics really quickly have, up to now, faltered miserably. So, neither have CPUs replaced GPUs, nor have GPUs come much closer to replacing CPUs. How far along is each camp now? And how will they perform within our crystal ball's prediction capacity horizon of, say six months from now? Let's look at the most favourite common metric between the two camps - the GFLOPS peak floating-point instruction rate, in double-precision of course.
CPU speedupsIntel's 32nm Westmere 6-core chip is the next major step in Chipzilla's roadmap. The flagbearer Westmere, in its Gulftown-EP dual CPU configuration and, a month or two later, single CPU desktop configuration, will provide 50 per cent more cores matched by 50 per cent more L3 cache at 12 MB, an improved memory controller able to support DDR3-1600MHz even as server memory by default, all within roughly the same clock speed range and die size as the current Nehalem chips.
If running at the standard non-Turbo mode 3.33GHz, the single Gulftown CPU will give you 80 GFLOPS of raw double-precision floating-point power, or twice that, 160 GFLOPS, in a dual-CPU workstation configuration. I do expect 3.6GHz parts to appear in the Gulftown stable too, before mid-2010.
While this sounds far below top numbers for the current GPUs, keep in mind this is fully general purpose floating-point for any application out there, today or tomorrow. No fancy programming tricks or new code needed. And, as usual, don't be surprised to see most of these Gulftowns doing well at north of 4GHz with even simple overclocking. How about 200 GFLOPS in a dual processor workstation by your deskside? With a grand total of six channels of DDR3-1600MHz server ECC memory, or DDR3-2000MHz desktop memory, these CPUs shouldn't be waiting for main memory data for too long.
AMD's Magny Cours, as two Istanbul dies in a single chip package, will at the same time pack twelve slower cores, probably not clocked higher than 2.4GHz to start. These two dies together will pack the same amount of total L3 cache as a single Intel Gulftown, but in theory will still be able to churn out over 110 GFLOPS of total peak double-precision floating-point power, or around 55 GFLOPS per die. Hopefully in the same time frame AMD will be able to speed up the single die Istanbul to above 3GHz, especially for the eventual desktop version.
The real floating-point throughput advances for both Intel and AMD CPUs should be seen in their end-2010 next generation cores, the Sandy Bridge for Intel - yes, the one with an integrated on-die GPU for some flavours - and the Bulldozer for AMD. Both should offer double the peak double-precision floating-point performance throughput per clock, enabling roughly 200 GFLOPS peak number munching power in a 4GHz, 12-core dual chip workstation setup, for example.
Issue: 107 | December, 2009