The problem is that a lot of workloads using random I/O have dependencies between the different I/O operations. A database traversing a tree-like index cannot issue the next read until the previous one has finished. You are limited by latency, which for NVMe is still orders of magnitude worse than memory.
> A database traversing a tree-like index cannot issue the next read until the previous one has finished.
This applies to a single point query in a single tree.
The latency is reduced by overlap in obvious ways as soon as you have (1) a range query because it can read multiple subtrees in parallel, or (2) a query that reads multiple indexes in parallel, or (3) multiple queries from the application to the database in parallel.
This is why it's useful to design applications to make multiple queries in parallel. Web applications are a great example of this. Most applications where I/O performance matters at all have some natural way to parallelise queries.
Less obviously, the interior blocks of a B-tree are a relatively small part of a B-tree. I.e. most of the space is in used leaf blocks. If the database's cache strategy gives preference to interior nodes, and even more preference to nodes closer to the root of a tree, often several interior layers of the tree can fit entirely in RAM and the effect is is to reduce the latency of tree lookups further once the cache is warmed up.
Then even in large databases (a few TB), the latency of a single point query is reduced to a one or two read IOPS (because the leaf page to read which contains the query result is calculated from in-memory data). The application-visible query time is very similar to the I/O subsystem's timing characteristics, and a few MQPS are achievable (= "million queries per second"). Not many database engines achieve this, because they were designed in an area where I/O was much slower, but the I/O architecture does support it.
Source: Wrote a performance-optimised database engine for blockchain archive data, which is extremely random access (because of hashing), in the multiple terabytes range, and the application is bottlenecked on how many queries per second it can achieve. It's like the ideal case for working on random-access I/O performance :-)
Those are rarely the slow ones though. Lots of software simply has not been written to keep IO queues full. They read a bit of data, process it, read the next bit and so on. On a single thread. This makes all kinds of IO (including network) way slower than it has to be.
For example a tree-index can be parallelized by walking down different branches. On top of that one can issue a prefetch for the next node (on each branch) while processing the current ones.
Yup, a lot of software is (was?) written with assumptions that mattered with spinning rust. And even if author didn't intend to, serial code generates serial, dependent IO.