CPU vs. GPU and the Future of Processor Architecture - Head to Head #28

By Staff Writers
00:00 Jan 7, 2004
Tags: CPU | vs | GPU | and | the | Future | of | Processor | Architecture | | Head | to | Head | #28

We really appreciate all the work our superscalar CPUs do. Actually, we're eternally grateful. So when James Wang lets us in on the secret that GPUs flog themselves just as much, we were more joyful than soulful. Literally. We're in negotiations with Satan right now to get our essences back.

To put the CPU and GPU side-to-side and compare their internals, I/O and design philosophy is a serious undertaking. Thanks to the latest class of DirectX 9-compliant GPUs, its also fascinating. Without the ability to branch, loop, or support floating-point data, a comparison between the CPU and GPU would make little sense. It's a fairly heavy topic, but not insurmountable. If at any time the technicalities get too chunky, just skim through. The most interesting and exciting aspects are covered in later parts when we investigate the future of both processors.

A simple concept: pipelining
Here's an experiment. Try using your brain to compute two additions simultaneously. You'll find that calculating '4 + 9' and '7 + 8' at once in your head is impossible. What you would have done is add up four and nine, and remember the result 13. Then add up seven and eight and remember the result 15 -- sequentially. In other words, for arithmetic, our brain is a single pipeline processor. It can sequentially compute but not two things at once.

A processor's (both CPU and GPU) pipeline has several stages. The first element (pipeline stage) may be a fetch unit that grabs instructions, followed by an arithmetic logic unit (ALU) and a final stage to write to memory. Because each stage is activated by the electrical signal from a clock cycle, one cycle will travel through and stage by stage (like our brain) turn on each element to produced the final result. The more stages in a pipeline, the more work gets done in one cycle. To put this in context, the Pentium 4 has a 20-stage pipeline. In GPUs, each stage in a pipeline has very specific roles. This makes it possible to have graphics transformed to verticis, lit, then rendered in a single swoop.

By now you can see that pipelining works because any additional calculating blocks use the same clock signal. And this makes adding more elements an obvious way to increase performance. Deeper pipelines also help scale processor speed, which is a key design feature of the Pentium 4's 20 stage pipeline.

It doesn't take a genius to take pipelining to the next level by having multiple pipes. This concept is called parallel pipelining, and is at the heart of a GPU's extreme fillrate. In CPUs, this design (having multiple, parallel arithmetic pipelines in the execution engine) is called 'superscalar' execution (Diagram 1). In other words, superscalar processors have the ability to do more than one math operation at the same time, whereas normally you'll need to wait for the next pipeline stage. When you combine the concept of a pipeline consisting of multiple stages, and a superscalar part that fuses ALUs together in the execution stage, you get our modern day CPU.

Memory Hierarchy
One thing that doesn't differ between CPUs and GPUs is their data source -- they both access the main memory. The entire memory hierarchy is illustrated in Diagram 2. You are probably familiar with all the stages, but if you haven't had any programming experience, the word 'register' may sound a little vague. Registers are important because they store the exact size of data that CPUs can work on. If two numbers are to be added, they must be stored in separate registers before the CPU can add them. As most operations boil down to arithmetic for a handful of inputs, it is important that the CPU can access these quickly.

Conceptually, think of Homer Simpson. To keep this food processing unit (FPU?) happily fed, the last stage of caching must be within 'reaching distance' of his couch. A CPU is like Homer on a lazy day; it can only work on things within grabbing distance and relies on memory controllers (Marge) to shuffle data (food) back and forth between the HDD (supermarket), memory (kitchen) and L1/L2 caches (tables).

CPUs and GPUs are really just a collection of processing elements, whose arrangement is known as CPU microarchitecture. However, this is just half the story. Today, CPUs are clocked so high that it's not just the raw clock speed that matters, but making sure the superscalar beasts are data-fed. This is why a memory hierarchy using multiple caches is employed. Even with full speed caches, the time it takes to fetch instructions from memory (latency) is still unacceptable for good performance. This is where the branch prediction comes in to play.  For example, a certain AI program may use the following logic: if there is a human player in the game, execute and repeat the 'attack_human' function until the player is dead. This is a case where branch prediction can allow the CPU to execute the 'attack_human' before the condition of 'is there a player' is evaluated. Considering that most of the time you will be alive, this pre-fetching mechanism will save a good deal of latency -- the same amount needed to evaluate the pretext. Due to the nature of most computing problems and the advancement of prediction algorithms, you will find that most branch prediction (>90%) is correct and hence is a vital part of CPU performance today.

GPUs, having just developed the ability to process conditional branches and loops (DirectX 9), have yet to incorporate any prediction logic.

CPU vs. GPU: Input and output
Input part is an exceptionally important element, as it's the only way to reach the theoretical performance numbers offered by both processors -- by making sure they are fully fed at all times.

The CPU mostly conducts short math operations. Unlike the GPU, data and instructions doesn't need to go through an ultra-complex process to produce stunningly abstract results. Most of the time is spent handling Windows tasks. Due to the relative simplicity of the tasks, instructions spends a very short amount of time on the CPU but is more or less shuffled back and forth between memory. There are of course exceptions to this such as data-centric scientific computing and repetitive tasks such as MPEG encoding and Photoshop filters. This is where the FPU and SIMD features of the CPU kick in, which are designed to handle repetitive math operations. Due to this fast processing and heavy shuffling nature, the bandwidth and latency of the memory subsystem is the most important concern of a powerful CPU. With RAMBUS leading memory performance these days, a CPU being fed by the top of the range 600MHz DDR-RDRAM receives a luscious 4.8GB/s of bandwidth to main memory.

By comparison, using the latest AGP 8x interconnect, a GPU has only 2.1GB/s to the memory. Even next to a more modest memory configuration such as PC2700, it still pales in comparison.
 
On the next level of memory is the L1/L2 cache. In this area, the CPU is set on a path that will greatly overun the GPU. This isn't to say that it's better, but rather the CPU needs it more than the GPU. Modern CPUs use a large fraction of their transistors on cache.

This is especially true for the bigger chips out there. This trend of more cache and less logic is set to continue, until the actual 'processing elements' are merely tiny blocks in a huge sea of cache. A Pentium 4 Northwood has 512KB of L2 cache and 8KB of L1. The follow-up Prescott is said to double this cache to 1MB for L2 and 16KB for L1. Cache budget is increasing due to the diminishing returns of adding more pipeline elements and the reliance on low latency storage.

A GPU, on the other hand, has two main caches: a vertex cache and a pixel cache. These numbers are never disclosed by ATI and NVIDIA, but are unlikely to be greater than 1MB.

Future CPU caches will greatly strip their equivalent GPU due to the different design philosophy. GPUs need the extra transistor budget to do more maths and can&qu

 
 
Aliens: Colonial Marines in depth; Z-77 Motherboard round-up; strategy gaming special; Home Server tutorial. PLUS MUCH MORE - ON SALE NOW!
 
Atomic Magazine

Issue: 137 | June, 2012

Atomic is a magazine aimed squarely at computer enthusiasts, gamers, and serious PC upgraders.

Every month we bring you the latest reviews of new technology and PC components, in depth features on everything from overclocking to console hacking, and gaming previews and interviews.
 
Latest Comments
 
Latest User Reviews
Battlefield 3 is the new benchmark online FPS
90%
A very fun and realistic multiplayer ride.
 
Antec Kuhler 920 - liquid cool
90%
Antec Kuhler 920 silent but effientive out of the box no maintence water cooling kit
 
Antec's Lanboy Air - our new favourite case
90%
Antec Lan boy Air in red a very cool design
 
Antec's Lanboy Air - our new favourite case
90%
This product overall is awesome.
 
MSI's GT780 laptop as fast as it gets
90%
Nice laptop