Analysis: Not all Cores are the same, it appears...
Our recent story on possible superdesktops using the upcoming Nehalem-EX and Nehalem-EP processors also hinted at their expected overclocking abilities, but what kinds of speeds might we realistically be able to achieve with these monsters?
After all, the 32nm Gulftown Nehalem-EP dualie should start at 3.6GHz for its six (6) cores per die, and will most probably run well above 4.5 GHz when properly cooled on a good mainboard in most cases. As for the biggie eight (8) core Nehalem-EX, even then the 2.66GHz top default clock should be arousable to some 3.2GHz under the right conditions. So, these multiprocessing brethren of the otherwise similar Core i7 should share its overclocking margin too.
Sounds easy, right? However, what would we really be overclocking there? Not everything, it seems. Even on the current Core i7, as you know, the default clock you see only applies to the four cores and their L1 and L2 caches. The shared 8MB L3 cache, the memory controller and the QPI interface, all collectively known as "uncore", have their own clock, which is asynchronous at that - read, more latency in between. This arrangement enables you to have, among other things, better overclocking for the "core" portion, but the latency and, to certain extent, access bandwidth to the uncore are sacrificed a bit. And, before you start bias-ranting, AMD was equally guilty of this with its Barcelona and Shanghai processors and its desktop Phenom CPUs too.
So, your glorious overclocking achievement may show the 4.00GHz on screen, but the L3 cache and memory controller inside the CPU might only be working at 2.66GHz if it's using DDR3-1333 DRAM, at double or more the memory data rate. Now, this may be necessary in the case of the humongous Nehalem-EX die where the 24MB L3 cache, four (4) memory channels and four (4) QPI links obviously can't run at a very high frequency, but either way your bandwidth and latency benchmarks will be affected, depending on both the "core" and the "uncore" clock rates.
On the desktop Core i7 running at 3.33GHz and running the Sandra 2009 latency test, the L1 cache may show 4-cycle latency compared to 3-cycle latency for the same-sized Core 2 cache, while the small L2 256KB cache will show 10 cycles, and the big shared L3 8MB cache block, anywhere from 37 to 46 cycles depending on the, yes, "uncore" clock - as you can see on this SiSoft Sandra 2009 shot. Now, the Core 2 large L2 cache of 12MB - two times 6MB, on two dies of course - shows just 16 or 18 cycles latency if staying within each dual-core die on the two-die MCM.
As reported elsewhere on the web, due to process and design improvements the mainstream version of the upcoming Sandy Bridge 32nm CPU should have somewhat improved latencies for the same cache structure as the current Core i7. The 32K L1 cache will be back to 3 cycles, the 256K L2 cache down to 9 cycles, and the 8MB L3 cache at 25 cycles - not bad for a cache shared between four CPU cores at the same time! This is a Core i5 follow-on, the higher end CPUs will have more cores, larger caches and possibly slightly larger latencies.
In summary, there's more to it than the clock numbers alone. Even within the same product family, subsequent steppings may have different design compromises to achieve the desired goals, some of them not widely known. And, as the CPUs become more complex, not just with differently-clocked async parts but also in various generations of "turbo" auto-overclock settings, one clock frequency number won't be sufficient to describe the speed anyway. How about, say, Core i5 XXX, core 3333 MHz, uncore 2667 MHz, turbo 3600 MHz, for instance?
theinquirer.net (c) 2009 Incisive Media
Issue: 107 | December, 2009