Core i7 CPU

Justin Robinson | Nov 3, 2008 3:22 PM
Intel | http://www.intel.com
Despite the cost, we still adore this chip for its incredible performance. In time, every enthusiast will need one.
Performance:  100
Overclockability:  79
Value :  62
Overall Rating:  90
We take a look at the Intel Nehalem CPU - the i965.
Intel has been dominating the processor industry for the past few years, spurred on by its amazing 65nm Core 2 Duo and Quad releases. This was only furthered when it shrank down these processors, squeezing the same cores into the same space using a 45nm process, bumping up both the cache and performance significantly. Two years later, we enter a new phase, complete with a new core, and a whole new Intel.

Architectural Advances
Nehalem is built around the tried-and-true 45nm process, but is set into a completely new foundation – the LGA1366 socket. This socket (funnily enough) has 1366 pins inside it, and is physically larger than the current LGA775. It uses a similar lever securing method as 775, but also includes a metal bracket over the back of the motherboard – physical stability and strength is much improved. But the socket wasn’t increased in size and sturdiness on a whim – it was to hold a whole new chip inside.

The new architecture follows a slight reworking of the previous Core 2 setup, starting with the Instruction Fetch and Pre-Decode stage. Here, the processor will retrieve the code, and then store it in the 32KB Level-1 cache, an incredibly fast cache located directly next to each core. This cache is fed by a 256KB Level-2 cache (and each core has one of these), and that in turn is fed by a shared pool of 8MB Level-3 cache, that each core can have more or less of depending on their needs, potentially giving a single core access to 8MB of cache.

The instructions are then passed through the Instruction Queue, and through the Decode stage. Here, a Branch Prediction unit attempts to ‘guess’ the possible outcomes of the current data, and allows the cache to load what data is most likely to be needed next into the L2 cache, saving time and increasing performance. After this stage, the instructions are decoded, and analysed by the Loop Stream Detector, which looks for repeating sequences in code and can store them for indefinite repetitions – essentially removing all the steps up until this stage.

The final step is to send the code to the Execution Units, whose task is to perform the calculations, which are temporarily stored in the final 32KB Data Cache until they are sent away for storage in either the L2/3 caches, or system memory.

Memory Interface
Traditionally, Intel systems would include the CPU, Northbridge, and memory, with the memory controller (the part that coordinates and facilitates memory communication) located in the Northbridge. This was a slow method, causing large amounts of latency (the time it takes to get the information from the memory to the CPU). Intel has now created an on-die memory controller, moving this part of the system out of the Northbridge. Using DDR3 memory only, this eliminates the need to access through another chip, and drops the latency by half, at the same time increasing bandwidth.

Not only that, but the memory controller has been upgraded, providing three channels of bandwidth with the system memory for theoretically up to 32GB/s. This is a massive increase over the previous Northbridge solution, and takes a leaf from AMD’s book (whihc has moved its memory controller onto the die some years previously).

Comprehensive SSE Support
With an updated architecture and platform, this also brings to the table updated support for all the latest instructions. A set of SSE instructions are a list of sorts that enables the CPU to understand and work with specific kinds of code more effectively. Nehalem supports all the SSE instructions up to and including 4.2, offering enhanced support for video codecs and encoding, with specific acceleration for Voice Recognition and DNA sequencing. It’s obvious that Intel would love for its chips to be powering the future technologies, and with forward-thinking support like this, it’s in the best position to take advantage.

Turbo mode, Engage!
The TDP (Thermal Design Power) of Nehalem is 130 Watts, which means that they are designed to make that much heat under heavy load. The problem with most applications today is that the CPU isn’t stressed across all four cores, leaving room for extra speed. Turbo mode provides that extra speed, actively increasing the multiplier of the CPU on-the-fly to provide a speed bump of a few hundred MHz while still remaining inside the TDP.

Of course, increasing all four cores is great when you’ve got a multithreaded application, but what if you’ve got a single threaded app? Well, those crazy boffins at Intel thought of that too. In the event that a single core is furiously working through a particularly demanding application, and the other cores are idly sitting there wasting power, Turbo mode can effectively power these idle cores down and increase the speed of the single core – and stay within the TDP – increasing performance in these applications when needed. This alone is probably the most important trend that we’ll see developing in the CPU design world, where the hardware itself will attempt to make allowances for inefficient or single-threaded programming.

Hyper-Space Threading
Hyper-Threading was one of the major points of contention back in the days of the (now ancient) Pentium 4 era. It is a function of the hardware to analyse code, and to locate any repetitions that happen multiple times (the Loop Stream Detector takes care of this). These repetitions are then split into two threads, and pumped concurrently to a single CPU core. The reasoning behind this is that the core is never left idle between instructions, and can work through the code more efficiently. This is why you’ll notice that Windows detects Nehalem as being an Octal-threaded Quad-core CPU. Applications that take advantage of this tech the most are video encoding and editing, as well as intense scientific calculations.

QPI – A whole new FSBallgame
You’re probably familiar (if not intimately so) with the FSB. The FSB, or Front Side Bus, is a method of connecting all the components on a motherboard with each other. This has been used in all Intel motherboards up until this point, and has proven to be a limiting factor in increasing performance, bottlenecking the amount of data that can be worked upon. It is also not very extensible, as each CPU or component added to the system takes a slice of the total bandwidth available, reducing the benefits of having two. The FSB also requires a very high frequency (333MHz is very common), and can place stress on some motherboard components. It’s essentially an outdated tech, which has spurred on the evolution of the FSB’s replacement – the QPI.

The QPI, or Quick Path Interconnect, is similar to AMD’s HyperTransport, or HT bus. The HT bus is a bi-directional parallel link, and the last standard (3.0) provides a theoretical bandwidth of just over 40GB/s per second. QPI, along much the same lines, provides multiple links between each core of the CPUs, memory, and all the components on the board. Since this is not shared, each link is free to operate at the full bandwidth, allowing a core on another CPU access to memory data at a very fast speed. Current theoretical performance on an X58 motherboard of the QPI is just over 25GB/s. This might not sound like it’s as good as AMD’s, but here’s the kicker – this is between every major component on the board, in any direction, at any time. In terms of real-world usability and viability, the QPI is a significant improvement, and Intel will surely encroach heavily on the Server market once these chips become available.

Testing Rig and Benchmarks
When testing a system like this, you really need the latest and greatest of everything. Thankfully, Intel knew this, and supplied us with the following, to which we added a power supply and graphics card:

i965 3.2GHz Nehalem CPU (8MB, 133MHz QPI link, 45nm process)
Quimonda 3x1GB DDR3 sticks (1067MHz, CL7)
Intel X25-M 80GB SSD
Intel DX58SO Extreme motherboard
Thermalright Ultra 120 Extreme cooler with LGA1366 mounting bracket


Apart from being the kind of tech that you’d do unscrupulous things just to get a look at, we finally had a beefy test rig ready to rip through our benchmark suite like no other CPU has before – even heavily overclocked ones. All these tests, unless mentioned, were run at stock settings with Turbo Mode turned off.

Our first benchmark was Cinebench R10, which can be run in either singlethreaded or multithreaded mode. In a single thread, we got a score of 4,252, or about a hundred more than an E8600 at stock. When we ran the benchmark in multithreaded mode, we got a score of 18,321! Thanks to multithreading, we actually recorded an increase in efficiency of 4.31x over a single core – Hyperthreading was actually doing something useful!

The next benchmark was Hexus PiFast, a singlethreaded application that finished in 27.56 seconds. This is about half a second faster than an E8600, showing us that these kinds of applications won’t benefit much from Nehalem. Moving on to our third benchmark, wPrime, we recorded a single-threaded performance of 37.297 seconds, a 16 per cent performance boost over an E8600 – due to the huge amount of cache available to this CPU, the test is chewed through very quickly. When we enabled multithreading however, we had the entire test done in only 7.766 seconds, a speed boost of 4.8x over a single core! Hyperthreading is really working its magic here, with the same calculations repeated over and over able to be split into concurrent threads, keeping the CPU bursting to the seams with data.

Our final benchmark is Everest, a system information tool that also includes built-in memory benchmarks. We noted a huge memory read bandwidth of 14247MB/s, definitely a sign that the triple-channel memory and integrated memory controller are working wonders – this is much higher than the current dual-channel DDR2. Write speeds were also huge, the fast DDR3 able to write at 15438MB/s, with a latency of only 38.8ns! This means that the wait for data is quite literally halved, meaning that Nehalem can keep up with even the most demanding of programs.

So what about overclocking?
With the introduction of a new CPU, there’s always the off-chance that it will be a dud overclocker. One that is so depressingly average that it refuses to be pushed, and simply throws instabilities and errors at you until you’re blue in the face (most likely from the light of the BSOD on your screen).
Thankfully this is not the case with Nehalem. Overclocking is still performed through the BIOS, and is still easy enough that the thousands of people currently playing with their Core 2s will be able to play around with Nehalem in a very similar manner. The hardest barrier that we found was knowing where to add voltage to get the chip stable – once we found this, overclocking was a very pleasant experience.

Even though we know Nehalem doesn’t use the FSB any more, it is still dependant on the frequency of the QPI bus to determine its working speed. So armed with our i965 and the 24x multiplier available, we set about pushing it in ways that Intel would like to pretend never happens. With a stock QPI frequency of 133MHz, this gives us 3192MHz (essentially 3.2GHz). Just like the FSB, all that’s required to overclock Nehalem is to increase this bus speed, done by simply entering in a larger number in the BIOS. We bumped this straight up to 150MHz for a test run and found a few instabilities – so the Vcore was raised from a stock of 1.25V to 1.3V, bringing us into Windows at a very comfortable 3.6GHz.

Not content with that small bump, we sat down with this board over the course of a day, and coaxed the highest speed out of this chip that we could – 3.936GHz with a frequency of 164MHz. To get this stable, we had to add voltage to the core (1.4725V), the QPI link (1.3V) and the memory controller (1.28V). This was all done on air cooling, so expect much better results on water or subzero cooling – especially with the large amount of heat generated by all four cores.

Verdict – Is Nehalem Worth it?
The answer is a resounding yes! With performance increased over the current wave of CPUs by an extremely large margin, and multithreaded performance in excess of what we’ve ever seen from a multicored CPU, Nehalem is an amazing chip. While it will take a while for Nehalem to trickle down into the affordable range (the RRP of the i965 is $AU2600, more or less), the good news is that prices in other Intel chip ranges are already dropping – so even if you don’t get Nehalem straight away you can still grab yourself some great performance thanks to this release.

We, for one, worship our new Nehalem overlords – and you should too.