Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Here's my quick take.

A top of the line Zen core is a powerful CPU with wide SIMD (AVX-512 is 16 lanes of 32 bit quantities), significant superscalar parallelism (capable of issuing approximately 4 SIMD operations per clock), and a high clock rate (over 5GHz). There isn't a lot of confusion about what constitutes a "core," though multithreading can inflate the "thread" count. See [1] for a detailed analysis of the Zen 5 line.

A single Granite Ridge core has peak 32 bit multiply-add performance of about 730 GFLOPS.

Nvidia, by contrast, uses the marketing term "core" to refer to a single SIMD lane. Their GPUs are organized as 32 SIMD lanes grouped into each "warp," and 4 warps grouped into a Streaming Multiprocessor (SM). CPU and GPU architectures can't be directly compared, but just going by peak floating point performance, the most comparable granularity to a CPU core is the SM. A warp is in some ways more powerful than a CPU core (generally wider SIMD, larger register file, more local SRAM, better latency hiding) but in other ways less (much less superscalar parallelism, lower clock, around 2.5GHz). A 4090 has 128 SMs, which is a lot and goes a long way to explaining why a GPU has so much throughput. A 1080, by contrast, has 20 SMs - still a goodly number but not mind-meltingly bigger than a high end CPU. See the Nvidia Ada whitepaper [2] for an extremely detailed breakdown of 4090 specs (among other things).

A single Nvidia 4090 "core" has peak 32 bit multiply-add performance of about 5 GFLOPS, while an SM has 640 GFLOPS.

I don't know anybody who counts tensor cores by core count, as the capacity of a "core" varies pretty widely by generation. It's almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precision and also whether the application can make use of the sparsity feature.

I'll also note that not all GPU vendors follow Nvidia's lead in counting individual SIMD lanes as "cores." Apple Silicon, by contrast, uses "core" to refer to a grouping of 128 SIMD lanes, similar to an Nvidia SM. A top of the line M2 Ultra contains 76 such cores, for 9728 SIMD lanes. I found Philip Turner's Metal benchmarks [3] useful for understanding the quantitative similarities and differences between Apple, AMD, and Nvidia GPUs.

[1]: http://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardo...

[2]: https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/n...

[3]: https://github.com/philipturner/metal-benchmarks



An x64 core roughly corresponding to a SM, or in the amdgpu world a compute unit (CU) seems right. It's in the same ballpark for power consumption, represents the component handling an instruction pointer and a local register file and so forth.

A really big CPU is a couple of hundred cores, a big GPU is a few hundred SM / CUs. Some low power chips are 8 x64 cores and 8 CUs on the same package. All roughly lines up.


If SIMD lanes come to vastly dominate the composition of a typical computer chip (in terms, e.g., of where power is consumed) will the distinction between CPU/GPU continue to be meaningful?

For decades the GPU was "special purpose" hardware dedicated to the math of screen graphics. If the type of larger-scale numerical computation that has been popularised with LLM's is now deemed "typical use", then the distinction may be becoming irrelevant (and even counterproductive from a software development perspective).


Xeon Phi was that, a bunch of little cores with a ton of SIMD lanes each.

It didn’t really work out, in part because it was too far from a regular old Xeon to run your normal code well without optimizing. On the other side, Intel couldn’t keep up with NVIDIA on the sort of metrics people care about for these compute accelerators: memory bandwidth mostly. If you are going to have to refactor your whole project anyway to use a compute accelerator, you probably want a pretty big reward. It isn’t obvious (to me at least) if this is the result of the fact that the Phi cores, simple as they were, were still a lot more complex than a GPU “core,” maybe the design just had various hidden bottlenecks that were too hard to work out due to that complexity. Or if it is because Intel just wasn’t executing very well at the time, especially compared to NVIDIA (it is Intel’s dark age vs NVIDIA’s golden age, really). The programmer’s “logical or” joke is a possible answer here.

But, you can’t do everything in parallel. It is a shame the Phi didn’t survive into the age where Intel is also doing big/little cores (in a single chip). A big Xeon core (for latency) surrounded by a bunch of little Phi cores (for throughput) could have been a really interesting device.


The special purpose graphics distinction is already mostly irrelevant and has been for 10 or 20 years for anyone doing High Performance Computing (HPC) or AI. It predates LLMs. For a while we had the acronym GPGPU - General Purpose computing on Graphics Processing Units [1]. But even that is now an anachronism, it started dying in 2007 when CUDA was released. With CUDA and OpenCL and compute shaders all being standard, it is now widely understood that today’s GPUs are used for general purpose compute and might not do any graphics. The bulk of chip area is general purpose and has been for some time. From a software development perspective GPU is just a legacy name but is not causing productivity problems or confusion.

To be fair, yes most GPUs still do come with things like texture units, video transcode units, ray tracing cores, and a framebuffer and video output. But that’s already changing and you have, for example, some GPUs with ray tracing, and some without that are more designed for data centers. And you don’t have to use the graphics functionality; for GPU supercomputers it’s common for the majority of GPU nodes to be compute-only.

In the mean time we now have CPUs with embedded GPUs (aka iGPUs), GPUs with embedded CPUs, GPUs that come paired with CPUs and a wide interconnect (like Nvidia Grace Hopper), CPU-GPU chips (like Apple M1), and yes CPUs in general have more and more SIMD.

It’s useful to have a name or a way to distinguish between a processor that mostly uses a single threaded SISD programming model and has a small handful of hardware threads, versus a processor that uses a SIMD/SIMT model and has tens of thousands of threads. That might be mainly a question of workloads and algorithms, but the old line between CPU and GPU is very blurry, headed towards extinction, and the “graphics” part has already lost meaning.

[1] https://en.wikipedia.org/wiki/General-purpose_computing_on_g...


The display controller, which handles the frame buffers and the video outputs, and the video decoding/encoding unit are two blocks that are usually well separated from the remainder of the GPU.

In many systems-on-a-chip, the 3 blocks, GPU in the strict sense, video decoder/encoder and display controller may even be licensed from different IP vendors and then combined in a single chip. Also in the CPUs with intgrated GPU, like Intel Lunar Lake and AMD Strix Point, these 3 blocks can be found in well separated locations on the silicon die.

What belongs into the GPU proper from the graphics-specific functions, because these perform operations that are mixed with the general-purpose computations done by shaders, are the ray-tracing units, the texture units and the rasterization units.


The x64 cores putting more hardware into the vector units and amdgpu changing from 64 wide to 32 wide simd (at least for some chips) looks like convergent evolution to me. My personal belief is that the speculation and pipelining approach is worse than the many tasks and swapping between them.

I think the APU designs from AMD are the transition pointing to the future. The GPU cores will gain increasing access to the raw hardware and the user interface until the CPU cores are optional and ultimately discarded.


There is little relationship between the reasons that determine the width of SIMD in CPUs and GPUs, so there is no convergence between them.

In the Intel/AMD CPUs, the 512-bit width, i.e. 64 bytes or 16 FP32 numbers, matches the width of the cache line and the width of a DRAM burst transfer, which simplifies the writing of optimized programs. This SIMD width also provides a good ratio between the power consumed in the execution units and the power wasted in the control part of the CPU (around 80% of the total power consumption goes to the execution units, which is much more than when using narrower SIMD instructions).

Increasing the SIMD width more than that in CPUs would complicate the interaction with the cache memories and with the main memory, while providing only a negligible improvement in the energy efficiency, so there is no reason to do this. At least in the following decade it is very unlikely that any CPU would increase the SIMD width beyond 16 FP32 numbers per operation.

On the other hand, the AMD GPUs before RDNA had a SIMD width of 64 FP32 numbers, but the operations were pipelined and executed in 4 clock cycles, so only 16 FP32 numbers were processed per clock cycle.

RDNA has doubled the width of the SIMD execution, processing 32 FP32 numbers per clock cycle. For this, SIMD instructions with a reduced width of 32 FP32 have been introduced, but they are executed in one clock cycle versus the old 64 FP32 instructions that were executed in four clock cycles. For backwards compatibility, RDNA has kept 64 FP32 instructions, which are executed in two clock cycles, but these were not recommended for new programs.

RDNA 3 has changed again all this, because now sometimes the 64 FP32 instructions can be executed in a single clock cycle, so they may be again preferable instead of the 32 FP32 instructions. However it is possible to take advantage of the increased width of the RDNA 3 SIMD execution units also when using 32 FP32 instructions, if certain new instructions are used, which encode double operations.

So the AMD GPUs have continuously evolved towards wider SIMD execution units, from 16 FP32 before RDNA, to 32 FP32 in RDNA and finally to 64 FP32 in RDNA 3.

The distance from CPUs has been steadily increasing, there is no convergence.


There are still a lot of differences, even if you put in a lot more SIMD lanes to the CPU. CPUs keep their execution resources fed by by aggressive caching, prefetching, and out of order execution while GPUs rely on having lots of threads around so that if one stalls another is able to execute.


Hi Raph, first of all thank you for all of your contributions and writings - I've learned a ton from reading your blog!

A minor quibble amidst your good comparison above ;)

For a zen5 core, we have 16-wide SIMD with 4 pipes; 2 are FMA (2 flop), and 2 are FADD @ ~5GHZ. I math that out to 16 * 6 * 5 = 480 GFLOP/core... am I missing something?


According to the initial reviews, it appears that when 512-bit instructions are executed at the maximum rate, this increases the power consumption enough so that the clock frequency drops to around 4 GHz for a 9950X

So a 9950X can do 256 FMA + 256 FADD for FP64 or 512 FMA + 512 FADD for FP32, per clock cycle.

Using FP32, because it can be compared with the GPUs, there are 1536 Flop per clock cycle, therefore about 6 FP32 Tflop/s @ 4 GHz for a 9950X (around 375 FP32 Gflop/s per core, but this number is irrelevant, because a single active core would go to a much higher clock frequency, probably over 5 GHz). For an application that uses only FMA, like matrix multiplication, the throughput would drop to around 4 FP32 Tflop/s or 2 FP64 Tflop/s.

The values for the FP32 throughput are similar to those of the best integrated GPUs that exist at this time. Therefore doing graphics rendering on the CPU on a 9950X might be similarly fast to doing graphics rendering on the iGPU on the best mobile CPUs. Doing graphics rendering on a 9950X can still leverage the graphics and video specific blocks contained in the anemic GPU included in 9950X, whose only problem is that it has a very small number of compute shaders, but their functions can be augmented by the strong CPU.


Thanks for the kind words and the clarification. I'm sure you're right; I was just multiplying things together without taking into account the different capabilities of the different execution units. Hopefully that doesn't invalidate the major points I was making.


For those of us not fluent in codenames:

Granite Ridge core = Zen 5 core.


> It's almost certainly best just to compare TFLOPS

Depends on what you're comparing with what, and the context, of course.

Casey is doing education, so that people learn how best to program these devices. A mere comparison of TFLOPS of CPU vs GPU would be useless towards those ends. Similarly, just a bare comparison of TFLOPS between different GPUs even of the same generation would mask architectural differences in how to in practice achieve those theoretical TFLOPS upper bounds.

I think Casey believes most people don't know how to program well for these devices/architectures. In that context, I think it's appropriate to be almost dismissive of TFLOPS comparison talk.


> Depends on what you're comparing with what, and the context, of course.

Agreed

It's a classic question of 'what is an "embarrassingly parallel" problem' (e.x. physics calculations, image rendering, LLM image creation or textual content generation) or not.


> It's almost certainly best just to compare TFLOPS - also a bit of a slippery concept, as that depends on the precision

Agreed. Some quibbles about the slipperiness of the concept.

flops are floating point operations. IMO it should not be confusing at all, just count single precision floating point operations, which all devices can do, and which are explicitly defined in the IEEE standard.

Half precision flops are interesting but should be called out for the non-standard metric they are. Anyone using half precision flops as a flop is either being intentionally misleading or is confused about user expectations.

On the other side, lots of scientific computing folks would rather have doubles, but IMO we should get with the times and learn to deal with less precision. It is fun, you get to make some trade-offs and you can see if your algorithms are really as robust as you expected. A free 2x speed up even on CPUs is pretty nice.

> and also whether the application can make use of the sparsity feature

Eh, I don’t like it. Flops are flops. Avoiding a computation exploiting sparsity is not a flop. If we want to take credit for flops not executed via sparsity, there’s a whole ecosystem of mostly-CPU “sparse matrix” codes to consider. Of course, GPUs have this nice 50% sparse feature, but nobody wants to compete against PARDISO or iterative solvers for really sparse problems, right? Haha.


In domains like ML, people care way more about the half precision FLOPs than single precision.


They don’t have much application outside ML, at least as far as I know. Just call them ML ops, and then they can include things like those funky shared exponent floating point formats, and or stuff with ints.

Or they could be measured in bits per second.

Actually I’m pretty interested in figuring out if we can use them for numerical linear algebra stuff, but I think it’d take some doing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: