There are already 6-8 major scrapers that do this constantly, across the whole i...

muglug · on Aug 23, 2016

Search engines respect robots.txt – not sure many scrapers do.

cookiecaper · on Aug 23, 2016

First, search engines are scrapers. No need to make a distinction.

Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. [0]

Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.

[0] "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8

nostrademons · on Aug 23, 2016

In the Google quote you link to, Google is not contacting your page. Rather, Google will index pages that are only linked to, which it has never crawled, and will serve up those pages if the link text matches your query. That's how you get those search results where the snippet is "A description of this page has been blocked by robots.txt" or similar.

There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.

hobs · on Aug 23, 2016

And the original point of my comment was that doing this is extremely rude and not appropriate, not that it couldn't be done or that others weren't doing it.

Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.

cookiecaper · on Aug 23, 2016

Please don't conflate "extremely rude", "not appropriate", and "guilt". Two of these are subjective opinions about what constitutes good citizenship. The last one is a legal determination that has the potential to deprive an individual of both his money and liberty. We're discussing whether these behaviors should be legal, not whether they are necessarily polite.

hobs · on Aug 24, 2016

I never did.

You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.

tedunangst · on Aug 23, 2016

Google will put forbidden pages in its index. It doesn't scrape them. (The URL to the page exists even without visiting the page.)

Animats · on Aug 23, 2016

We do, and we also use our own user-agent string: "SiteTruth.com site rating system". A growing number of sites reject connections based on USER-AGENT string. Try "redfin.com", for example. (We list those as "blocked"). Some sites won't let us read the "robots.txt" file. In some cases, the site's USER-AGENT test forbids things the "robots.txt" allows.

Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point. This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.

Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.

Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.