I thought NVMe flash latency was measured in tens of microseconds. 3x RAM would ...

derefr · on Nov 26, 2022

Under ideal conditions, yes. But the 3x difference I see in practice is less about NVMe being just that good; and more about operations against (main) memory getting bottlenecked under high all-cores concurrent access with no cross-workload memory locality to enable any useful cache coherence. And also about memory accesses only being “in play” when a worker thread isn’t context-switched out; while PCIe-triggered NVMe DMA can proceed while the thread has yielded for some other reason.

In other words, when measured E2E in the context of a larger work-step (one large enough to be interrupted by a context-switch), the mean, amortized difference between the two types of fetch becomes <3x.

Top of my wishlist for future architectures is “more, lower-width memory channels” — i.e. increased intra-CPU NUMAification. Maybe something CXL.mem will roughly simulate — kind of a move from circuit-switched memory to packet-switched memory, as it were.

MonaroVXR · on Nov 27, 2022

How do you figure these things out, do you have special software to look at this?

emn13 · on Nov 26, 2022

I think the person you're replying to is confusing IOPS with latency. If you add enough parallelism, then NAND flash random-read IOPS will eventually reach DRAM performance.

But it's not going to be easy - for a sense of scale I just tested a 7950x at stock speeds with stock JEDEC DDR5 timings. I inserted a bunch of numbers in an 8GB block of memory, and with a deterministic random seed randomly pick 4kb pages, computing their sum and eventually reporting that (to avoid overly clever dead-code analysis, and make sure the data is fully read).

With an SSD-friendly 4K page size that resulted in 2.8 million iops of QD1 random read. By comparison, a web search for intel's 5800x optane's QD1 results shows 0.11 million iops, and that's the fastest random read SSD there is at those queue depths, AFAIK.

If you add parallelism, then ddr5 reaches 11.6 million iops at QD16 (via 16 threads), fast SSDs reach around 1 million, the optane reaches 1.5 million. An Epyc Genoa server chip has 6 times as many DDR5 memory channels as this client system does; and I'm not sure how well that scales, but 60 million 4kb random read iops sounds reasonable, I assume. Intel's memory controllers are supposedly even better (at least for clients). Turning on XMP and PBO improves results by 15-20%; and even tighter secondary/tertiary timings are likely possible.

I don't think you're going to reach those numbers not even with 24 fast NVMe drives.

And then there's the fact that I picked the ssd-friendly 4kb size; 64-byte random reads reach 260 million iops - that's not quite as much bandwidth as @ 4kb, but the scaling is pretty decent. Good luck reaching those kind of numbers on SSDs, let alone the kind of numbers a 12-channel server might reach...

We're getting close enough that the loss in performance at highly parallel workloads is perhaps acceptable enough for some applications. But it's still going to be a serious engineering challenge to even get there, and you're only going to come close under ideal (for the NAND) circumstances - lower parallelism or smaller pages and it's pretty much hopeless to arrive at even the same order of magnitude.

jlokier · on Nov 26, 2022

I measured ~1.2M IOPS (random reads, 4kiB) from 3xNVMe in a software RAID configuration on a commodity server running Ubuntu Linux in 2021. Using Samsung SSDs, not Optane.

If that scaled, it would be 9.6M IOPS from 24xNVMe.

emn13 · on Nov 27, 2022

Which is quite respectable, but still nevertheless a far cry from the presumable 60M+ iops the server would have using dram (if it scales linearly, which I doubt, it would hit 70M). Also, DRAM gets quite close to those numbers with only around 2 times as many threads as dram channels, but that NVMe setup will likely need parallelism of at least 100 to reach that - maybe much more.

Still, a mere factor 7 isn't a _huge_ difference. Plenty of use cases for that, especially since NAND has other advantages like cost/GB, capacity, and persistence.

But it's also not like this is going to replace dram very quickly. Iops is one thing, but latency is another, and there dram is still much faster; like close to 1000 times faster.

867-5309 · on Nov 26, 2022

at this point cost would become the bottleneck. compare 24x1TB NVMe drives to 24TB of DDR5

emn13 · on Nov 26, 2022

That's an entirely different dimension - you can reach these throughput numbers on DDR5 likely even with merely 16GB. And an massive 12-channel 6TB socket solution will likely have slightly less than 6 times the random-read bandwidth. Capacity and bandwidth aren't strongly related here.