I wrote a fairly complex spidering and scraping script in Node a few months ago....

ralusek · on Aug 23, 2016

Threading in node is very easy, just use clusters. Alternatively, take any of the CPU intensive activity, like parsing the HTML and formatting as JSON, and just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)

stupidcar · on Aug 23, 2016

Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.

LunaSea · on Aug 24, 2016

Threading for a crawler is just a dirty way of not handling distribution. When you will need more than one server your threads won't save you. It has nothing to do with Node.js and thread support.

stupidcar · on Aug 24, 2016

I wasn't creating a new search engine, I was doing a one-off scraping job in my spare time. Creating a fully distributed solution would have been total overkill. But threading could and would have helped.

Honestly, stupidly hostile and ignorant comments like this are the absolute worst thing about Hacker News.

Bjartr · on Aug 23, 2016

Wonder how difficult it would have been to pull the JS portion into Java by way of Rhino.