Lemire's technique is really nice, in general a good thing to learn about, since it's a bit mind bending how it's playing with intervals. Sadly last time I benchmarked it in code on x86-64 for cryptographic purposes, it wasn't faster than rejection sampling, or just using a large value and a modulo reduction: in all cases what is actually taking a lot of time is the call to get good quality randomness out of a CSPRNG, the rest being negligible in comparison.