Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wrote a fairly complex spidering and scraping script in Node a few months ago. I found downcache[1] to be absolutely invaluable, particularly as I was debugging my parsing scripts, a I was able to rerun them relatively quickly over the cached responses.

However, when the network was no longer a bottleneck, I found that the speed and single-threaded nature of Node became one. It wasn't really that slow, relatively speaking, but I had a few hundred gigs of HTML to chew through every time I made a correction, so it was important to keep the turnaround as fast as possible.

I eventually managed to manually partition the task so I could launch separate Node scripts to handle different parts of it, but it wasn't a perfect split, and there was a fair bit of duplicated work, where a shared cache would have helped a great deal.

In retrospect, I should have thrown my JS away and started again in something with easy threading like Java or C#. But -- familiar story -- I'd underestimated the complexity of the task to begin with, and by the time I understood, I'd sunk a lot of time into writing my JS parsing code and didn't fancy converting it all to another language, particular when it always seemed like "just one more" correction to the parsing would make everything work right. In the end, what was supposed to take a weekend took about three months of work, off and on, to finish.

[1] https://www.npmjs.com/package/downcache



Threading in node is very easy, just use clusters. Alternatively, take any of the CPU intensive activity, like parsing the HTML and formatting as JSON, and just put that on an AWS lambda.

You can invoke as many lambdas from your application as you want in parallel and you're not going to be bottlenecked by your CPU :)


Clustering in Node creates isolated child processes, not threads. I needed to have shared queues, in-memory caches, and hashes to coordinate workers and avoid them doing duplicate work.

I'm did consider using clustering and having some master process coordinate everything, and using some shared-memory caching library. But it would not be "easy" to set up, especially compared to something like Java where you get thread pools and synchronized thread-safe collections out of the box.

And Lambda would have been totally impractical. As I said, I had hundred of gigs of data to process. If I'd been uploading this over my puny ADSL upstream every time, I'd still be waiting for a single run to complete.

I'm not trashing Node. I like it. There's a reason I used in the first place, after all. But for this particular use-case, I didn't find it was very good fit.


Threading for a crawler is just a dirty way of not handling distribution. When you will need more than one server your threads won't save you. It has nothing to do with Node.js and thread support.


I wasn't creating a new search engine, I was doing a one-off scraping job in my spare time. Creating a fully distributed solution would have been total overkill. But threading could and would have helped.

Honestly, stupidly hostile and ignorant comments like this are the absolute worst thing about Hacker News.


Wonder how difficult it would have been to pull the JS portion into Java by way of Rhino.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: