Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

.. who is running LLMs on CPU instead of GPU or TPU/NPU


Actually that's a really good question, I hadn't considered that the comparison here is just CPU vs using Metal (CPU+GPU).

To answer the question though - I think this would be used for cases where you are building an app that wants to utilize a small AI model while at the same time having the GPU free to do graphics related things, which I'm guessing is why Apple stuck these into their hardware in the first place.

Here is an interesting comparison between the two from a whisper.cpp thread - ignoring startup times - the CPU+ANE seems about on par with CPU+GPU: https://github.com/ggml-org/whisper.cpp/pull/566#issuecommen...


It essentially never makes sense to run on the CPU and you will only ever see enthusiasts doing it.

Yes, hammering the GPU too hard can affect the display server, but no, switching to the CPU is not a good alternative


Not switching to the CPU - switching to the ANE (Neural Cores) - if you read the research papers Apple has released - the example I gave is pretty much how it's being used - small image classification models running on the ANE, alongside a graphics app that needs the GPU to be free.


Oh, yes, I misread! It’s great for that


Depends on the size of the model and how much VRAM you have (and how long you're willing to wait).


Not all of us own GPUs worth using. Now, among people using macs... Maybe if you had a hardware failure?


[flagged]


M3 Ultra has a big GPU with 819 GB/sec bandwidth.

LLM performance is twice as fast as RTX 5090

https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...


> LLM performance is twice as fast as RTX 5090

your tests are wrong. you used MLX for Mac Studio (optimized for Apple Silicon) but you didn't use vLLM for 5090. There's no way a machine with half the bandwidth of 5090 delivers twice as fast tok/s.


Unless it’s a large model that doesn’t fit in the 5090, bust that’s no longer a $4k macstudio I think.


that's orthogonal to the speed discussion.

also, the GP was mostly testing models that fit in both 5090 and Mac Studio.


$4k will get you a 96 GB Mac Studio with M3 Ultra (819 GB/sec).

That's 3x the RAM of the 5090.


> That's 3x the RAM of the 5090

And a bit less than half the bandwidth (saying for completeness).


Yeah that's probably wrong. But the M3 Ultra is good enough for local inferencing, in any case.


Pretty sure they're using the 80 GPU cores available in that case.


And that still performs worse than entry-level Nvidia gaming cards.

Apple isn't serious about AI and needs to figure their AI story out. Every other big tech company is doing something about it.


They're basically second place behind NVIDIA for model inference performance and often the only game in town for the average person if you're trying to run larger models that wont fit in the 16 or 24gb of memory available in top-shelf NVIDIA offerings.

I wouldn't say Apple isn't serious about AI, they had the forethought to build the shared memory architecture with the insane memory bandwidth needed for these types of tasks, while at the same time designing neural cores specifically for small on-device models needed for future apps.

I'd say Apple is currently ahead of NVIDIA in just sheer memory available - which for doing training and inference on large models, it's kinda crucial, at least right now. NVIDIA seems to be purposefully limiting the memory available in their consumers cards which is pretty short sighted I think.


Not true. It performs 20-30% better than a RTX A6000 (I have both). Except it has more than 10 times the VRAM. For a comparison with newer Nvidia cards, benchmarks say it does substantially better than a 5070ti, a bit better than a 4080, and a bit worse than a 5080. But once again, it got 30 times the vram amount of the mentioned cards, which for AI workloads are just expensive toys due to lack of vram indeed.


Not for inferencing. M3 Ultra runs big LLMs twice as fast as RTX 5090.

https://creativestrategies.com/mac-studio-m3-ultra-ai-workst...

RTX 5090 only has 32GB RAM. M3 Ultra has up to 512 GB with 819 GB/sec bandwidth. It can run models that will not fit on an RTX card.

EDIT: Benchmark may not be properly utilizing the 5090. But the M3 Ultra is way more capable than an entry level RTX card at LLM inferencing.


My little $599 Mac Mini does inference about 15-20% slower than a 5070 in my kids’ gaming rig. They cost about the same, and I got a free computer.

Nvidia makes an incredible product, but apples different market segmentation strategy might make it a real player in the long run.


It can run models that cannot fit on TEN rtx 5090s (yes, it can run DeepSeek V3/R1, quantized at 4 bit, at a honest 18-19 tok/s, and that's a model you cannot fit into 10 5090s..).


Right, that's the $9500 Mac Studio with 512GB RAM and 80-core GPU.

16x the RAM of RTX 5090.

There are two versions of the M3 Ultra

28-core CPU, 60-core GPU

32-core CPU, 80-core GPU

Both have a 32-core Neural Engine.


Can we stop with the derisive “fanboy” nonsense? Most people don’t say “FOSS” fanboy or Linux “fanboy” — but plenty of people here are exactly that. It’s a bit insulting to people that like and appreciate Mac hardware; just because you might not like it doesn’t mean you have to be so dismissive. And that Mac Studio is a very impressive computer — but it’s usually the ones that have never used on that seem to have to most opinions about them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: