I see two issues. Web scraping is clearly a business model problem, and that’s partly due to scale.
If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads. Examples are ad blockers, answers embedded in Google results, Stack Overflow clones, and things like ChatGPT.
If ads weren’t your business model you wouldn’t be using revenue from it.
The other issue is scale, and I don’t know how to address it.
It’s easy for someone (say the government) to have a friendly policy and say “you can use dig in a park” thinking it’s useful to campers and such.
But when someone shows up with a professional strip mining crew, things are different.
If you run a site providing quality information for free, making money off book sales or professional services or such can be a good living. Even if answers end up in the Google answer box, more complicated stuff or analysis still requires a visit to read and people can start following you from there.
But if ChatGPT or whatever can “read” your stuff and give out 80% of the value without anyone even knowing it came from you, you’re screwed. Your business model no longer works. Any kind of “give away good information” business model fails. Same issue artists are now seeing.
And I don’t know how you fix that without some kind of ban. But unless every country everywhere enforces one… you have to work with the lowest common denominator and lock all your content up. No web search. No Google answers. No chat GPT. “Please don’t scrape me” in robots.txt won’t work.
It's interesting, because it's essentially the same exact discussion as traditional copyrights (e.g. for books). The only difference is that book authors are generally not giving away their books for free on their personal website. Copyrights are the attempt to protect the business model of authors who want to sell copies of something that are otherwise extremely easy and cheap to copy. Attempts to legally limit web scraping are an attempt to protect the business of model of creators who want to give away for free copies of things that are easy and cheap to copy, but only we come directly to the creator to get our free copy.
There's a nuanced difference between this new wave of AI scraping and the old "copying a site" scraping, and it's not a copyright issue.
The original problem with copyright was that the website owner's content could be duplicated elsewhere, and thus, violate copyright (as well as suck away the web traffic and presumably lowering revenue).
The new AI issue is not that the content is duplicated elsewhere, but that the knowledge contained in the content is "learnt", and used to produce a different work (totally copyright free - in the truest sense, as it is original). An example would be a recipe website. The site owner could've painstakingly collected recipes from the literature, and cataloged, labelled it, etc, making searching and such easy. But the recipes themselves are not copyrightable, only the expression of the recipe.
So given this info, the AI scrapper now has a large labelled dataset for which to learn from, and to generate new recipes. These new recipes do not violate _any_ existing copyright, as they are entirely original in expression.
I say, as an AI advocate, that the old business model of recipe hosting is destroyed by this new AI, and legislating it to remain by legal means is just fighting against the tide. After all, the world doesn't have a unified jurisdiction, and the internet is world wide, so any would-be violators could just as easily move to a different country to operate.
You’re right . That’s why scraping must be unlimited and legal for all. Any information accessible from internet should be legal to refine. Thus also us using GPT services to train our own models, scraping anything that’s publicly accessible. Our only defense is competing services that refines the data even more than any general llm. The solution is almost never regulation but competition. Fair competition
I don't think that's true. "Right to erasure" still works just as well as it always has, but you might need to ask the folks who have scraped and are re-sharing your information to also delete your personal data. That's not an unreasonable thing to have happen, nor is it an unreasonable thing to expect.
Let's suppose an embarrassing image of Person X is shared on Facebook and Person X uses their right to erasure with Facebook to delete their profile. Facebook has no control over the folks who may have downloaded or screenshot-ed that photo and turned it into subsequent memes. Likewise, if someone straight up scrapes and re-shares, that's not Facebook responsibility.
What I don't want to see happen is for:
1. Facebook to make it somehow impossible for anyone to ever copy or screenshot that or any photo, preventing anyone from ever doing anything with photos on Facebook without Facebook's explicit permission. This would seem to be quite the loss of user agency for very little society wide benefit (also, how would they do this?)
2. Facebook to somehow "control" that photo so closely that Facebook is able to remotely revoke folk's copies and screenshots of said photo in the spirit of "abiding by a persons right to erasure"; that'd be a huge overreach, but seems like the only other way to approach this (though "how" is also an open question).
Even asserting that "unlimited scraping makes some privacy regulations moot" seems like an implication that we can only have privacy laws by going towards situation #1, and that doesn't seem accurate given that folks can use existing privacy laws to remove content from any distributor (as long as they're compliant).
Not exactly. You can request a site to erase all the data it has on you, but not that they erase the memories of everyone who has seen this data. How is this any different?
Your tone implies you're serious, but I struggle to believe anyone could possibly equate persisting digital media with recalling a memory.
In case you really need an example to elucidate, consider reproducing an image. A scraper can quite literally accomplish that, trivially; a great artist would still be limited in multiple facets of the recreation, such that even one with the best memory and hand would find themselves far short of pixel-perfect.
I wonder how we would regard a person who could reliably perform such a feat whenever he pleased. Would we sterilize him, lest he give rise to a bunch of cute little privacy-invading monsters?
If the feat you mean is to perfectly recall disparaging information they see about people on web sites, we already have people with quite good memories. Irrelevance usually keeps them from bringing up the details of strangers' lives on a regular basis. If the juicy details are about friends or acquaintances, well, it's very easy to destroy one's social position - at least, with non-toxic people - by endlessly and tiresomely discussing other people's misfortunes or mistakes.
How many of them saved it and then reuploaded it elsewhere? Sorry, but talking about protecting the privacy of people who upload things for anyone to see just seems silly to me.
So at which scale does the copying of data lower privacy, such that humans looking at it and potentially screenshooting it doesn't, but automated processes copying it does?
No, but since we are talking about laws, it is important to define the point beyond which a kind of behavior becomes unacceptable, or at least some set of criteria to determine when a specific instance is beyond that point.
> If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads
I don't think a paywall would fix this. One paid account is all a scraper needs. It couldn't really even be rate-limited if it's just "reading" articles as they become available. After the data is acquired it can be dispensed. If directly posting it violates copyright, then obscuring it behind AI will do the trick just fine.
But it stops being trivial. Now to scape websites en mass you have to automate signing up for them, probably paying for it.
And unlike now to sign up you have to agree to a very enforceable EULA.
So instead of going to court with “FunAI read my public website and is making money off it which I don’t think that should be fair use”, you have “FunAI violated a contract they signed and committed fraud by lying on signup”.
Seems to me that’s much easier.
There will always be people who get the content for free somehow. You don’t have to stop 100%. Even stopping 95% would be a lot better than the current 0%.
If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads. Examples are ad blockers, answers embedded in Google results, Stack Overflow clones, and things like ChatGPT.
If ads weren’t your business model you wouldn’t be using revenue from it.
The other issue is scale, and I don’t know how to address it.
It’s easy for someone (say the government) to have a friendly policy and say “you can use dig in a park” thinking it’s useful to campers and such.
But when someone shows up with a professional strip mining crew, things are different.
If you run a site providing quality information for free, making money off book sales or professional services or such can be a good living. Even if answers end up in the Google answer box, more complicated stuff or analysis still requires a visit to read and people can start following you from there.
But if ChatGPT or whatever can “read” your stuff and give out 80% of the value without anyone even knowing it came from you, you’re screwed. Your business model no longer works. Any kind of “give away good information” business model fails. Same issue artists are now seeing.
And I don’t know how you fix that without some kind of ban. But unless every country everywhere enforces one… you have to work with the lowest common denominator and lock all your content up. No web search. No Google answers. No chat GPT. “Please don’t scrape me” in robots.txt won’t work.