Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> the additional layers of indirection sometimes produce novel output, and many times do not.

I think this is the key insight. It differs from something like say, JPEG (de)compression, in that it also produces novel but sensible combinations of both a number of copyrighted and non-copyrighted data, independent of their original context. In fact, I'd argue that is its main purpose. To describe it as just a lossy compressed natural-language-queryable database as a result would be reductive to its function and a mischaracterization. It can recall extended segments of its training data as demonstrated by the paper, yes, but it also cannot plagiarize the entirety of a given source data, as also described by the paper.

> why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works?

Because these companies and services on their own are not producing the output that is substantially similar. They (possibly) do it on user input. You could make a case that they should perform filtering and detection, but I'm not sure that's a good idea, since the user might totally have the rights to create a substantially similar work to something copyrighted, such as when they themselves own the rights or have a license to that thing. At which point, you can only hold the user themselves responsible. I guess detection on its own might be reasonable to require, in order to provide the user with the capability to not incriminate themselves, should that indeed not be their goal. This is a lot like with famous people detection and filtering, which I'm sure tech reviewers have to battle from time to time.

This isn't to say they shouldn't be held responsible for pirating these copyrighted bits of content in the first place though. And if they perform automated generation of substantially similar content, that would still be problematic following this logic. Not thinking of chain-of-thought here mind you, but something more silly, like writing a harness to scrape sentiment and reactively generate things based on that. Or to use, idk, weather or current time and their own prompts as the trigger.

Let me give you a possibly terrible example. Should Blizzard be held accountable in Germany, when users from there on the servers located on there stand in a shape of a nazi swastika ingame, and then publish screenshots and screen recordings of this on the internet? I don't think so. User action played crucial role in the reproduction of the hate symbol in question there. Conversely, LLMs aren't just spouting off whatever, they're prompted. The researchers in the paper had to put in focused efforts to perform extraction. Despite popular characterization, these are not copycat machines, and they're not just pulling out all their answers out of a magic basket cause we all ask obvious things answered before on the internet. Maybe if the aforementioned detections were added, people would finally stop coping about them this way.



One runs the risk of being reductive when examining a mechanisms irreducible parts.

User expression is a beast unto itself, but I wonder if that alone absolves the service provider? I imagine Blizzard has an extensive and mature moderation apparatus to police and discourage such behavior. There's an acceptable level of justice and accountability in place. Yet there are even more terrible real-life examples of illicit behavior outpacing moderation and overrunning platforms to the point of legal intervention and termination. Moderating user behavior is one thing, but how do you propose moderating AI expression?

A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data. I think we could agree that a model trained on the worst possible data you can imagine is something we don't need in the world, no matter how well behaved your prompting is.


I do not propose moderating "AI expression" - I explicitly propose otherwise, and further propose mandating that the user is provided with source attribution information, so that they can choose not to infringe, should they be at risk of doing so, and should they find that a concern (or even choose to acquire a license instead). Whether this is technologically feasible, I'm not sure, but it very much feels like to me that it should be.

> A digression from copyright - portraying models as a "blank canvas" is itself a poor characterization, output might be triggered by a prompt, like a query against a database, but its ultimately a reflection of the contents of the training data.

I'm not sure how to respond to this if at all, I think I addressed how I characterize the functionality of these models in sufficient detail. This just reads to me like an "I disagree" - and that's fine, but then that's also kinda it. Then we disagree and that's okay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: