GigaHertz processors - extracting bang for the buck - Feature #27

By Staff Writers
00:00 Jan 13, 2004
Tags: GigaHertz | processors | | extracting | bang | for | the | buck | | Feature | #27

Gigahertz-class processor chips have transformed the gaming scene in a way unimaginable a decade ago. In this month's feature Dr Carlo Kopp explains how to extract the biggest bang for your expended dollar.

As Atomicans appreciate, the emergence of processor chips with clock speeds in excess of a Gigahertz has been a major advance for the gaming community. There is no doubt that chips with multiple GHz clocks provide an unprecedented gain in the performance potential of consumer systems, and thus gaming applications.

That said, actual computer performance is distinct from performance potential, as poor design and implementation of applications, operating systems and motherboards may see even Gigahertz-class clock speed processors yield little return in achieved performance against their siblings of a half decade ago.

In this month's issue, Atomic will explore what is stopping us from getting the full performance potential of this generation of processors, and discuss what strategies a gamer can pursue to extract as much bang as possible from the available bucks.

Birth of the GigaHertz processors
The biggest breakthrough contributing to the breaking of the 1GHz clock speed barrier was the introduction, in 2000/2001, of Copper metallisation fab technology. After many years of research IBM cracked the problem of how to replace Aluminium, the mainstay until then of on-chip wiring. More conductive Copper reduces series resistance in the wiring on the chip, in turn reducing signal delay effects.

The first processor to use this technology was AMD's K7/Athlon, soon followed by the Pentium 4. We have seen most of the mainstream chip manufacturers shift to Copper since then.

A microprocessor in this class will have tens of millions of transistors on chip, clock speeds between 1.7 and 3.05 GHz (this year) and a six to nine-way superscalar architecture incorporating capabilities such as speculative execution and out-of-order execution. More than likely a four to eight-way set associative Level 2 cache of up to 512KBs will be integrated on the chip die, as well as a split Level 1 cache of up to 128KBs.

The machine is likely to be using a 64-bit or wider system bus to main memory running at a clock speed between 100 to 533 MHz, depending on the chipset in use.

By any measure, such machines have formidable performance potential if properly exploited.
As performance scales in part with clock speed, but also with the degree of superscalarity in the CPU and the hit ratio of the CPU caches, for many applications such chips will yield performance gains greater than the ratio of its clock speed against the clock speed of a sub 1 GHz chip of similar architecture.

Superficially, we might assume that if application X running on operating system Y delivers Z amount of performance, going from a 700 MHz CPU to a 2.1 GHz CPU should cut to a third either the achievable response time in interactive work, or triple the speed/workload of non-interactive work.

Careful examination of published benchmarks suggests otherwise, with the scaling in benchmark figures with clock speed ratios falling short of the ideal N-fold improvement in performance.

To an observer not well versed in machine architecture, this seeming incongruity might appear to be puzzling. To understand why this is, we must delve a little deeper.

Impediments to performance
Let us assume an ideal world - a dangerous but useful form of speculation. In this world the application and operating system are always resident in the CPU's internal on-chip caches, and the binary executable code created by the compiler is dominated by instructions that are not mutually dependent.

In this ideal world, the processor will chew through the stream of instructions in the application at its peak achievable throughput almost all of the time. Of the N execution units inside the superscalar CPU, nearly all of the N will be active all of the time.

As the application is always resident in the cache, in this ideal model every instruction fetch sees the nanosecond class access time of the cache, rather than the tens of nanosecond access times of the main memory. Therefore, the CPU sees an uninterrupted stream of instructions and is never stalled for want of instructions to process.

Reality, however, might be very different. Superscalar architectures (eg. P4, Athlon, PowerPC) can execute at peak output only when the code they are executing contains few instructions with mutual dependency. Whenever an instruction is dependent upon the results of a previous instruction, there is potential for performance to be lost. Computer architects use the term 'Instruction Level Parallelism' or ILP to describe the property of executable code, whereby little mutual dependency exists between instructions.

ABOVE: This plot is for a simulated machine architecture, which uses a progessively larger Harvard L1 intruction cache.

A particularly troublesome situation for superscalar processors is conditional branching, as waiting for the outcome of a branch has the potential to empty the internal pipelines in the CPU, incurring significant latency times to refill.

Modern processors contain numerous design features aimed at exploiting ILP and also avoiding stalls.

Speculative execution is perhaps the most popular technique in modern processors. At the cost of considerable complexity in logic, supportable due to the large transistor counts available, the CPU will prefetch and execute both outcomes of the branch instruction and discard the outcome that is not used. The difficulty with speculative execution is that the problem becomes intractable where consecutive branches are found.

Consider a piece of code in which four or five branches are nested. The first branch statement results in two paths to speculatively execute. Each consecutive branch doubles the number of instruction streams, from two to four, four to eight and so on. The logic handling speculative execution must prefetch instructions for each of these streams. In practice this imposes limits on how far the CPU can `look ahead' into the code and speculatively execute.

Once that limit is hit, the CPU stalls waiting for the resolution to the pending branch operation.

The difficulty with the ILP problem is that mutual dependency between operands in instructions is often implicit within the algorithm being used. Therefore no amount of compiler optimisation or other clever trickery can beat the problem.

Within any `soup' of instructions found in a binary module, there will be threads of instructions with mutual dependency, and these in effect form critical timing paths for the algorithms being used. The problem is fundamentally the same as the critical path problem seen in any PERT chart.

The issue of cache performance is no less daunting with a very fast CPU. The aim of all CPU caches, be they combined data/instruction caches, or Harvard architecture / split caches with dedicated data and instruction paths, is to hide the poor performance of the main memory. If the instruction or data is resident in the cache, it can be accessed within a clock cycle or two, if not, then the CPU has to wait for the data or instruction to be found in the main memory.

The classical metric of cache performance is the `hit ratio', or the ratio of cache accesses, which find the data or instruction in the cache, to the total number of accesses.

A cache design that matches a program well will deliver hit ratios well above 90%.
Machine architects like to use a simpl

 
 
Aliens: Colonial Marines in depth; Z-77 Motherboard round-up; strategy gaming special; Home Server tutorial. PLUS MUCH MORE - ON SALE NOW!
 
Atomic Magazine

Issue: 137 | June, 2012

Atomic is a magazine aimed squarely at computer enthusiasts, gamers, and serious PC upgraders.

Every month we bring you the latest reviews of new technology and PC components, in depth features on everything from overclocking to console hacking, and gaming previews and interviews.
 
Latest Comments
 
Latest User Reviews
Battlefield 3 is the new benchmark online FPS
90%
A very fun and realistic multiplayer ride.
 
Antec Kuhler 920 - liquid cool
90%
Antec Kuhler 920 silent but effientive out of the box no maintence water cooling kit
 
Antec's Lanboy Air - our new favourite case
90%
Antec Lan boy Air in red a very cool design
 
Antec's Lanboy Air - our new favourite case
90%
This product overall is awesome.
 
MSI's GT780 laptop as fast as it gets
90%
Nice laptop