The latest Nehalem-grade Xeon is a performance and efficiency breakthrough.
In about three weeks, we will be coming to the sixth anniversary of the Opteron launch, quite a long time for Intel to be without several key technologies pioneered by that chip. With the launch of the Nehalem EP, Intel finally closed that gap, adopted all those technologies, and has taken the lead in just about every category.
The new cores, great as they may be, are somewhat overshadowed by the uncore parts of the new CPUs. When coupled, the two make a package that wins almost every benchmark out there, very often by large margins. Let's take a look at what this new platform brings to the table.
The first thing people notice is it's a native quad core CPU, not a dual-die MCM like many of it's predecessors. This may seem a trivial point, but does have massive implications with respect to how memory is seen and how caches are shared, a common bottleneck in modern server CPUs. Nehalem has 64K of L1 cache, split equally between Data and Instruction, and 256K of L2 per core. On top of that, there is 8M of inclusive L3 shared between the cores.
Nehalem sports SSE4.2 instructions, has an integrated three-channel DDR3 controller, and two QPI links. If that isn't enough for you, simultaneous multi-threading (SMT/HT) makes a comeback and virtualisation is heavily updated. All of this is done with 731 million transistors packed into 263mm2.
The biggest bang for buck - or in this case million transistors - is undoubtedly QPI, the technology formerly known as CSI. QPI is a point-to-point link like AMD's Hypertransport, but much newer and faster. The idea is that, instead of connecting each CPU to a single chipset, they connect to each other and share data directly. No more bottlenecks, no more shared FSB, much lower latency and, in general, better everything.
If you recall, the core i7 introduced in 1S consumer markets late last year had many of the same features, but not the inter-CPU QPI. The chipset, called Tylersburg, is connected through a similar, but different, interconnect. It all looks like this, with nothing much having changed in the two years since we first printed the diagrams. QPI runs at up to 6.4G transfers per second (25.6 GBps full duplex), so you don't have to wait long for packets.
Almost as important is the integrated memory controller. Nehalem brings that to the table in style, with up to three channels of DDR3-1333 supported. Doing one socket this way isn't a trick, two is hard, four gets downright messy.
Intel has pulled it off, and latency is pretty darn good with worst case numbers showing that remote memory access with DDR3-1067 is still a hair faster than Harpertown with FBD-1600. Local access takes only 60 per cent of that time. Remember, Harpertown had a shared memory controller, and there were far fewer cache coherency problems to deal with. To distribute the load and make it faster at the same time is a huge feat.
One of the bigger gains comes from the 'new way' of doing SMT, and this part is the same as the older i7. If you recall, the much-loved-by-none HT in the P4 generation had the CPU switch gears by pausing one thread and picking up another. There was little efficiency to be gained unless the threads were paused for a long time. That old way is on the left.
theinquirer.net (c) 2010 Incisive Media
Issue: 133 | February, 2012