> An obvious optimization would be to utilize all available CPU cores by using the MapReduce pattern with multiple threads.
Nope, the GIL will make that useless. You need to actually implement the tight loops in C/C++ and call that with batches of data to get benefits from threading.
An obvious, but more expensive optimization would be to use a process pool. Make sure that all the objects you pass around are serializable.
Python makes optimization much harder than it should be. I hope the GIL gets the hammer at some point, but that seems to be a huge task.
For this problem the multiple process version would be quite simple in python or any other languages. It's a classic same program multiple data (SPMD) task. You split the file into N chunks than run N versions of the original program on it (a Map). You then need to collate the results, which required a second program, but that step is similar to the sorting step in the original and so would be negligible wrt wall time (a quick Reduce).
For large files you should get almost embarrassing parallelism.
>If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.
I think your SSD/network link/database might be able to work in parallel even when Python can't. Details:
Suppose I am scraping a website using a breadth-first approach. I have a long queue of pages to scrape. A single-threaded scraper looks like: pop the next page in the queue, block until the web server returns that page, repeat. A multi-threaded scraper looks like: thread wakes up, pops the next page in the queue, sleeps until the web server returns that page, repeat. With the multi-threaded scraper I can initiate additional downloads while the thread sleeps.
My assumption here is that the download over the network is at some level being performed by making a system call (how could it not be?) And once you have multiple system calls going, they can be as parallel as the OS permits them to be; the OS doesn't have to worry about the GIL. And also the server should be able to serve requests in parallel (assuming for the sake of argument that the server doesn't suffer from the GIL).
Same essential argument applies to the database. Suppose I'm communicating with the database using IPC. The database isn't written in Python and doesn't suffer from the GIL. Multiple Python threads can be sleeping on the database while the database processes their requests, possibly in parallel if the db supports that.
I think this argument could even work for the SSD if the kernel is able to batch your requests in a way that takes advantage of the hardware, according to this person: https://news.ycombinator.com/item?id=33752411
Very curious to hear your thoughts here. Essentially my argument is that the SSD/network link/database could be a "bottleneck" in terms of latency without being the bottleneck in terms of throughput (i.e. it has unused parallel capacity even though it's operating at maximum speed).
You're right, my comment only applies when bandwidth is the bottleneck. In Python, that web parser could probably do even better with asyncio in a single OS thread.
> If I/O was the bottleneck, parallelizing it won't help, your SSD/network link/database won't get magically faster.
Of course it will. Near every serious DB will allow to work on multiple requests in parallel and unless the DB itself is on something that's slow you will get data faster from 2 parallel requests than from serializing them
NVMe SSDs in particular can easily fill what single thread can read, just run fio with single vs parallel threads to see that.
> If you're writing to files, most of the time the parallism will be hard to implement correctly. SQLite doesn't support parallel writes for example.
That's just one random example. If all you do is "read data, parse ,write data" in some batch job you can have massive parallelism. Sharding is also easy way to fill up the IO.
Nope, the GIL will make that useless. You need to actually implement the tight loops in C/C++ and call that with batches of data to get benefits from threading.
An obvious, but more expensive optimization would be to use a process pool. Make sure that all the objects you pass around are serializable.
Python makes optimization much harder than it should be. I hope the GIL gets the hammer at some point, but that seems to be a huge task.