I do a significant amount of scraping for hobby projects, albeit mostly open web...

kough · on Aug 23, 2016

Yet another incredible technical achievement due to someone's quest for more porn (https://github.com/fake-name/AutoTriever/blob/master/setting...).

fake-name · on Aug 23, 2016

That's a separate project:

- https://github.com/fake-name/ExHentai-Archival

- https://github.com/fake-name/PatreonArchiver

- https://github.com/fake-name/xA-Scraper

- https://github.com/fake-name/DanbooruScraper

Or... well, 4 separate projects. Whoops?

At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.

Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.

Next up, automate the consumption too!

someuser18 · on Aug 24, 2016

At least Ex supports torrents and also has some custom p2p software which you can run (serves content) from which data can be siphoned off.

And what is served through their website is resized. So web-scraping is an inferior approach.

fake-name · on Aug 24, 2016

You seem to be assuming

1. I'm scraping the resized galleries.

2. I don't have the Hath perk that makes the galleries full sized.

3. I don't have a phash-based fuzzy image deduplication system on top of all this (see https://github.com/fake-name/IntraArchiveDeduplicator). It's main purpose is to deduplicate manga (https://github.com/fake-name/MangaCMS).

ycombinatorMan · on Aug 27, 2016

Jesus, your projects are massive. Does your job involve working on these or are these just side things?

fake-name · on Aug 30, 2016

It's all entirely hobby things.

andai · on Aug 26, 2016

Oh my god. Can you share any results?

fake-name · on Aug 30, 2016

The project never went anywhere, unfortunately, and I haven't had time to look at it recently.

I have huge, uh, "datasets" around still, though.

clevernickname · on Aug 24, 2016

You're doing god's work.

monsoon22 · on Aug 23, 2016

How do you circumvent cloud provider IP blocks? For example, one site blocks all requests from AWS EC2 servers.

fake-name · on Aug 23, 2016

None of the sites I'm scraping do that, mostly.

I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.

If I run into that sort of thing, I guess we'll see.

nickysielicki · on Aug 24, 2016

I've never used this, and it's incredibly shady considering the users probably do not realize that their Hola browser plugin does this, but Hola runs a paid VPN service where you can get thousands of low-bandwidth connections on unique residential IP addresses, provided generously through their "free" VPN users.... It's essentially a legitimate attempt at running a botnet as a service.

But if the end justifies the means... http://luminati.io/

siegecraft · on Aug 23, 2016

I'm sure there'd be a ton of people that would love to pay to use your platform (who cares if the source is available, I don't want to run my own because once the code is written, it's ops thats hard). But then I suppose it would be hard to stay unnoticed.

fake-name · on Aug 23, 2016

Yeah, running this thing publicly would be a huge mess from a copyright perspective, since it literally re-hosts everything as a core part of how it works.

As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.

Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.

nickysielicki · on Aug 24, 2016

This project is really cool. Last year I was looking into open source projects that implement something like Readability so that I could scrape articles from my RSS feeds and turn them into plaintext. But I didn't find anything that blew me away. The best I got was stealing the implementation from Firefox, and I lost interest before I could make it worthwhile. (Now revisiting the idea, I wonder why I never thought of passing a user-agent from a mobile browser... Probably would have helped a lot.)

I see you don't have a license listed on GitHub. Do you have a license in mind for these?

fake-name · on Aug 24, 2016

It's probably GPL, I'll have to figure out my dependencies and see what it's infected with. I tend to err BSD on my own cruft.

This isn't quite as fancy as readability, though I integrated a port of readability for a while. Now I just write a ruleset for a site that has stuff that interests me.

fake-name · on Aug 25, 2016

Note: I stuck it under BSD license.

atmosx · on Aug 24, 2016

Similar, paid solution: https://scrapinghub.com/crawlera/

uptown · on Aug 24, 2016

What would the rough costs be to run the 800k UA scenario?

fake-name · on Aug 24, 2016

To be clear, I have a pool of 800K theoretical UAs derived from the mechanism I use to generate them, not 800K clients.

Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.

franciskim · on Aug 23, 2016

take my money!

Greg-J · on Aug 23, 2016

How can I get ahold of you directly?

fake-name · on Aug 23, 2016

connorw at imaginaryindustries dot com

Greg-J · on Aug 23, 2016

Thanks. Email sent.

chmars · on Aug 23, 2016

Wow, that's impressive!