I see two issues. Web scraping is clearly a business model problem, and that’s p...

tshaddox · on Aug 25, 2023

It's interesting, because it's essentially the same exact discussion as traditional copyrights (e.g. for books). The only difference is that book authors are generally not giving away their books for free on their personal website. Copyrights are the attempt to protect the business model of authors who want to sell copies of something that are otherwise extremely easy and cheap to copy. Attempts to legally limit web scraping are an attempt to protect the business of model of creators who want to give away for free copies of things that are easy and cheap to copy, but only we come directly to the creator to get our free copy.

chii · on Aug 26, 2023

There's a nuanced difference between this new wave of AI scraping and the old "copying a site" scraping, and it's not a copyright issue.

The original problem with copyright was that the website owner's content could be duplicated elsewhere, and thus, violate copyright (as well as suck away the web traffic and presumably lowering revenue).

The new AI issue is not that the content is duplicated elsewhere, but that the knowledge contained in the content is "learnt", and used to produce a different work (totally copyright free - in the truest sense, as it is original). An example would be a recipe website. The site owner could've painstakingly collected recipes from the literature, and cataloged, labelled it, etc, making searching and such easy. But the recipes themselves are not copyrightable, only the expression of the recipe.

So given this info, the AI scrapper now has a large labelled dataset for which to learn from, and to generate new recipes. These new recipes do not violate _any_ existing copyright, as they are entirely original in expression.

I say, as an AI advocate, that the old business model of recipe hosting is destroyed by this new AI, and legislating it to remain by legal means is just fighting against the tide. After all, the world doesn't have a unified jurisdiction, and the internet is world wide, so any would-be violators could just as easily move to a different country to operate.

drunkencoder · on Aug 25, 2023

You’re right . That’s why scraping must be unlimited and legal for all. Any information accessible from internet should be legal to refine. Thus also us using GPT services to train our own models, scraping anything that’s publicly accessible. Our only defense is competing services that refines the data even more than any general llm. The solution is almost never regulation but competition. Fair competition

antonf · on Aug 25, 2023

> You’re right . That’s why scraping must be unlimited and legal for all.

Unlimited scraping makes some of privacy regulations moot. Such as right to erasure (ability to delete personal data from a platform).

lelandbatey · on Aug 25, 2023

I don't think that's true. "Right to erasure" still works just as well as it always has, but you might need to ask the folks who have scraped and are re-sharing your information to also delete your personal data. That's not an unreasonable thing to have happen, nor is it an unreasonable thing to expect.

Let's suppose an embarrassing image of Person X is shared on Facebook and Person X uses their right to erasure with Facebook to delete their profile. Facebook has no control over the folks who may have downloaded or screenshot-ed that photo and turned it into subsequent memes. Likewise, if someone straight up scrapes and re-shares, that's not Facebook responsibility.

What I don't want to see happen is for:

1. Facebook to make it somehow impossible for anyone to ever copy or screenshot that or any photo, preventing anyone from ever doing anything with photos on Facebook without Facebook's explicit permission. This would seem to be quite the loss of user agency for very little society wide benefit (also, how would they do this?)

2. Facebook to somehow "control" that photo so closely that Facebook is able to remotely revoke folk's copies and screenshots of said photo in the spirit of "abiding by a persons right to erasure"; that'd be a huge overreach, but seems like the only other way to approach this (though "how" is also an open question).

Even asserting that "unlimited scraping makes some privacy regulations moot" seems like an implication that we can only have privacy laws by going towards situation #1, and that doesn't seem accurate given that folks can use existing privacy laws to remove content from any distributor (as long as they're compliant).

fluoridation · on Aug 25, 2023

Not exactly. You can request a site to erase all the data it has on you, but not that they erase the memories of everyone who has seen this data. How is this any different?

nawgz · on Aug 25, 2023

Your tone implies you're serious, but I struggle to believe anyone could possibly equate persisting digital media with recalling a memory.

In case you really need an example to elucidate, consider reproducing an image. A scraper can quite literally accomplish that, trivially; a great artist would still be limited in multiple facets of the recreation, such that even one with the best memory and hand would find themselves far short of pixel-perfect.

Genghis_Khan · on Aug 26, 2023

I wonder how we would regard a person who could reliably perform such a feat whenever he pleased. Would we sterilize him, lest he give rise to a bunch of cute little privacy-invading monsters?

jprete · on Aug 26, 2023

If the feat you mean is to perfectly recall disparaging information they see about people on web sites, we already have people with quite good memories. Irrelevance usually keeps them from bringing up the details of strangers' lives on a regular basis. If the juicy details are about friends or acquaintances, well, it's very easy to destroy one's social position - at least, with non-toxic people - by endlessly and tiresomely discussing other people's misfortunes or mistakes.

brendamn · on Aug 25, 2023

How many people who have seen that data are acting as a service to share it, at scale?

fluoridation · on Aug 25, 2023

How many of them saved it and then reuploaded it elsewhere? Sorry, but talking about protecting the privacy of people who upload things for anyone to see just seems silly to me.

text0404 · on Aug 25, 2023

scale

fluoridation · on Aug 25, 2023

So at which scale does the copying of data lower privacy, such that humans looking at it and potentially screenshooting it doesn't, but automated processes copying it does?

jprete · on Aug 26, 2023

A fuzzy boundary doesn't make the two sides equivalent.

fluoridation · on Aug 26, 2023

No, but since we are talking about laws, it is important to define the point beyond which a kind of behavior becomes unacceptable, or at least some set of criteria to determine when a specific instance is beyond that point.

giraffe_lady · on Aug 25, 2023

You're making an idealogical argument but not confronting any of the business problems raised in the other comment.

MetaWhirledPeas · on Aug 25, 2023

> If you give away your content for free and expect ads to sustain you, that will start failing once others get the value out of your content without seeing the ads

I don't think a paywall would fix this. One paid account is all a scraper needs. It couldn't really even be rate-limited if it's just "reading" articles as they become available. After the data is acquired it can be dispensed. If directly posting it violates copyright, then obscuring it behind AI will do the trick just fine.

MBCook · on Aug 25, 2023

But it stops being trivial. Now to scape websites en mass you have to automate signing up for them, probably paying for it.

And unlike now to sign up you have to agree to a very enforceable EULA.

So instead of going to court with “FunAI read my public website and is making money off it which I don’t think that should be fair use”, you have “FunAI violated a contract they signed and committed fraud by lying on signup”.

Seems to me that’s much easier.

There will always be people who get the content for free somehow. You don’t have to stop 100%. Even stopping 95% would be a lot better than the current 0%.