I do a significant amount of scraping for hobby projects, albeit mostly open websites. As a result, I've gotten pretty good a circumventing rate-limiting and most other controls.
I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.
At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.
Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.
At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.
---
Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.
There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).
At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.
Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.
I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.
If I run into that sort of thing, I guess we'll see.
I've never used this, and it's incredibly shady considering the users probably do not realize that their Hola browser plugin does this, but Hola runs a paid VPN service where you can get thousands of low-bandwidth connections on unique residential IP addresses, provided generously through their "free" VPN users.... It's essentially a legitimate attempt at running a botnet as a service.
I'm sure there'd be a ton of people that would love to pay to use your platform (who cares if the source is available, I don't want to run my own because once the code is written, it's ops thats hard). But then I suppose it would be hard to stay unnoticed.
Yeah, running this thing publicly would be a huge mess from a copyright perspective, since it literally re-hosts everything as a core part of how it works.
As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.
Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.
This project is really cool. Last year I was looking into open source projects that implement something like Readability so that I could scrape articles from my RSS feeds and turn them into plaintext. But I didn't find anything that blew me away. The best I got was stealing the implementation from Firefox, and I lost interest before I could make it worthwhile. (Now revisiting the idea, I wonder why I never thought of passing a user-agent from a mobile browser... Probably would have helped a lot.)
I see you don't have a license listed on GitHub. Do you have a license in mind for these?
It's probably GPL, I'll have to figure out my dependencies and see what it's infected with. I tend to err BSD on my own cruft.
This isn't quite as fancy as readability, though I integrated a port of readability for a while. Now I just write a ruleset for a site that has stuff that interests me.
I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.
At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.
Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.
At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.
---
Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.
There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).
---
It's all on github, FWIW:
Manager: https://github.com/fake-name/ReadableWebProxy
Agent and salt scheduler: https://github.com/fake-name/AutoTriever