> An obvious optimization would be to utilize all available CPU cores by using t...

xmcqdpt2 · on Nov 26, 2022

For this problem the multiple process version would be quite simple in python or any other languages. It's a classic same program multiple data (SPMD) task. You split the file into N chunks than run N versions of the original program on it (a Map). You then need to collate the results, which required a second program, but that step is similar to the sorting step in the original and so would be negligible wrt wall time (a quick Reduce).

For large files you should get almost embarrassing parallelism.

jvanderbot · on Nov 26, 2022

Oh I think a few simd instructions could reduce processing to near zero without going crazy with multi-threaded architectures.

Remember that fizzbuzz on HN that hit GB/s? Mostly SIMD. Zero multi-threaded IIRC.

ShredKazoo · on Nov 26, 2022

The GIL won't prevent you from parallelizing I/O will it?

plonk · on Nov 26, 2022

If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

If I/O wasn't the bottleneck, I guess you can parallelize reading, but what are you gaining?

If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.

ShredKazoo · on Nov 26, 2022

>If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

I think your SSD/network link/database might be able to work in parallel even when Python can't. Details:

Suppose I am scraping a website using a breadth-first approach. I have a long queue of pages to scrape. A single-threaded scraper looks like: pop the next page in the queue, block until the web server returns that page, repeat. A multi-threaded scraper looks like: thread wakes up, pops the next page in the queue, sleeps until the web server returns that page, repeat. With the multi-threaded scraper I can initiate additional downloads while the thread sleeps.

My assumption here is that the download over the network is at some level being performed by making a system call (how could it not be?) And once you have multiple system calls going, they can be as parallel as the OS permits them to be; the OS doesn't have to worry about the GIL. And also the server should be able to serve requests in parallel (assuming for the sake of argument that the server doesn't suffer from the GIL).

Same essential argument applies to the database. Suppose I'm communicating with the database using IPC. The database isn't written in Python and doesn't suffer from the GIL. Multiple Python threads can be sleeping on the database while the database processes their requests, possibly in parallel if the db supports that.

I think this argument could even work for the SSD if the kernel is able to batch your requests in a way that takes advantage of the hardware, according to this person: https://news.ycombinator.com/item?id=33752411

Very curious to hear your thoughts here. Essentially my argument is that the SSD/network link/database could be a "bottleneck" in terms of latency without being the bottleneck in terms of throughput (i.e. it has unused parallel capacity even though it's operating at maximum speed).

plonk · on Nov 26, 2022

You're right, my comment only applies when bandwidth is the bottleneck. In Python, that web parser could probably do even better with asyncio in a single OS thread.

ilyt · on Nov 26, 2022

> If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.

Of course it will. Near every serious DB will allow to work on multiple requests in parallel and unless the DB itself is on something that's slow you will get data faster from 2 parallel requests than from serializing them

NVMe SSDs in particular can easily fill what single thread can read, just run fio with single vs parallel threads to see that.

> If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.

That's just one random example. If all you do is "read data, parse ,write data" in some batch job you can have massive parallelism. Sharding is also easy way to fill up the IO.

srcreigh · on Nov 26, 2022

Haven't tried it, but SQLite supports some type of concurrent modification now

https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...

therealdrag0 · on Nov 26, 2022

I’ve definitely parallelized http requests for significant improvement in python before.

Waterluvian · on Nov 26, 2022

You can use Processpool but at that point you’re way into re-architecting.

plonk · on Nov 26, 2022

Not much more than the switch to MapReduce and threads. Actually the interface is exactly the same if you use executors from concurrent.futures.

znpy · on Nov 26, 2022

inter-process communication has its own overhead though.

plonk · on Nov 26, 2022

You can usually counteract that by sending large enough batches to the processes.

Dunedan · on Nov 26, 2022

> Nope, the GIL will make that useless.

In Python yes. I missed that. The Go implementation would still benefit from multiple threads, wouldn't it?

plonk · on Nov 26, 2022

Yes I think so.