I wonder however, if this paper might imply the answer.
"But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely."
I wonder if we could exclude the full text of these books from the training data and still approximate this result? Harry Potter and 1984 are probably some of the most quoted texts on the internet.
>Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?
I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
I just tried ChatGPT:
>I can’t provide the full text of Harry Potter, as it’s copyrighted material. However, I can summarize it, discuss specific scenes or characters, or help analyze the themes or writing style if that’s useful. Let me know what you're after.
For my money, as long as the AI companies treat the reproduction of copyrighted material as a failure state, the nature of the training data is irrelevant.
> I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Suppose a company stores the whole of stack exchange in plaintext, then implements a chat-like interface that fuzzy matches on question, extracts answers from plain-text database, fuzzes top-rated/accepted answers together and outputs something, not necessarily quoting one distinct answer, but pretty damn close.
How much "fuzziness" is required for this to stop being copyright violation? LLM-advocates try to say that LLMs are "fuzzy enough" without clearly defining what that enough means.
>Let me offer a different perspective. Having an LLM that is trained on copyrighted material, memoized (or lossily compressed it) and then some "safety" machinery that tries to avoid verbatim-ish outputs of copyrighted material is fundamentally not really distinguishable from simply having a plaintext database of copyrighted material with machinery for "fuzzy" data extraction from said material.
Right so sort of like a search engine that caches thumbnails of copyrighted images to display quick search results? Something I have been using for years and have no issues with, where the legal arguments are framed more about where the links go, and how easy the search engine makes it for me to acquire the original image?
Would your argument be the same if it was a human? If a person memorizes a book verbatim, however uses safety/common sense not the transcribe the book for others because it is a copyright infringement disallow him from using the information memorized whatsoever because he can duplicate it?
I’m saying that it doesn’t matter what humans do this machine isn’t a human.
There is no reason to believe that humans and machines should be the same under the law.
The clearest example of this is that in the US it’s already been decided that ai generated art can’t be copyrighted because it was made by a computer rather than a person. Same as for the monkey selfie.
"But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don't memorize most books -- either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and 1984, almost entirely."
I wonder if we could exclude the full text of these books from the training data and still approximate this result? Harry Potter and 1984 are probably some of the most quoted texts on the internet.
>Unless you advocate for discarding the whole regime of intellectual property or you can argue for a better model of IP laws, the question stands: why shouldn't LLM services trained on copyrighted material be held responsible when their product violates "substantial similarity" of said copyrighted works? Why should failure to do so be immune from legal action?
I think you are on the right track but for me personally it really depends on how difficult it was to produce the result. Like if you enter "spit out harry potter and the philosophers stone" and it does. Thats black and white. But if you are able to torture a repeated prompt that forces the model to ignore its constraints, thats not exactly using the system as intended.
I just tried ChatGPT:
>I can’t provide the full text of Harry Potter, as it’s copyrighted material. However, I can summarize it, discuss specific scenes or characters, or help analyze the themes or writing style if that’s useful. Let me know what you're after.
For my money, as long as the AI companies treat the reproduction of copyrighted material as a failure state, the nature of the training data is irrelevant.