There are already 6-8 major scrapers that do this constantly, across the whole internet, called search engines. You can't handle that?
What if you get a normal user who says "Hey, I wanna see some of the lesser known authors on this platform" and opens up a hundred tabs with rarely-read blogs? What if you get 10 users who decide to do that on the same day? Is it reasonable to sue them? Should there be a legal protection to punish them for making your site slow?
Don't blame the user for your scaling issues. If the optimized browser ("scraper") isn't hammering your site at a massively unnatural interval, it's clean. And if it is, you should have server-side controls that prevent one client from asking for too much data.
These are just normal problems that are part of being on the web. It's not fair to pin it on non-malicious users, even if they're not using a conventional desktop browser.
First, search engines are scrapers. No need to make a distinction.
Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. [0]
Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.
[0] "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8
In the Google quote you link to, Google is not contacting your page. Rather, Google will index pages that are only linked to, which it has never crawled, and will serve up those pages if the link text matches your query. That's how you get those search results where the snippet is "A description of this page has been blocked by robots.txt" or similar.
There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.
And the original point of my comment was that doing this is extremely rude and not appropriate, not that it couldn't be done or that others weren't doing it.
Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.
Please don't conflate "extremely rude", "not appropriate", and "guilt". Two of these are subjective opinions about what constitutes good citizenship. The last one is a legal determination that has the potential to deprive an individual of both his money and liberty. We're discussing whether these behaviors should be legal, not whether they are necessarily polite.
You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.
We do, and we also use our own user-agent string: "SiteTruth.com site rating system". A growing number of sites reject connections based on USER-AGENT string. Try "redfin.com", for example. (We list those as "blocked"). Some sites won't let us read the "robots.txt" file. In some cases, the site's USER-AGENT test forbids things the "robots.txt" allows.
Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point.
This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.
Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.
Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.
What if you get a normal user who says "Hey, I wanna see some of the lesser known authors on this platform" and opens up a hundred tabs with rarely-read blogs? What if you get 10 users who decide to do that on the same day? Is it reasonable to sue them? Should there be a legal protection to punish them for making your site slow?
Don't blame the user for your scaling issues. If the optimized browser ("scraper") isn't hammering your site at a massively unnatural interval, it's clean. And if it is, you should have server-side controls that prevent one client from asking for too much data.
These are just normal problems that are part of being on the web. It's not fair to pin it on non-malicious users, even if they're not using a conventional desktop browser.