Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AI training shouldn't erase authorship (justine.lol)
79 points by Tomte on Aug 23, 2024 | hide | past | favorite | 24 comments


I think that in principle one could tag the training set with "source" tags and express weights as a sum of subweights for each source tag; during backpropagation, the overall weight and subweight for the training sample would be updated, and during inference linear operations would happen on the subweights as well, while nonlinear operations would scale all subweights by the ratio.

This should in principle allow to determine how much each source influenced each output token of the LLM.

The problem is that this multiplies storage and compute time for tagged inference by the number of source tags, so it may be impractical to actually tag single documents or authors, but might be useful for very broad categories like "copyrighted" vs "non-copyrighted", "synthetic" vs "human generated", "photo" vs "drawing" vs "rendering", year range of publication, etc.


That sounds like a cool research project. However I think it would be enough to simply (1) not erase authorship data, and (2) not fine-tune and monitor LLMs to suppress outputs that mention people. It's an emergent property that they can trace the provenance of text. For example, if I give it an often repeated quote from a book written in the 1800's, then it can tell me where it came from. That's hundreds of years of noise it's weeding through. Imagine what language models could do for recently created knowledge.


> For example, if I give it an often repeated quote from a book written in the 1800's, then it can tell me where it came from. That's hundreds of years of noise it's weeding through. Imagine what language models could do for recently created knowledge.

Alternatively: That's 100s of years of mentions of that quote it can pull from.


When I lookup an algorithm on Stack overflow I can also see the discussions around it, people poking holes at it, successively corrected versions so in the end I understand how, why and trust that it works. LLMs spit out something with no context, like crappy enterprise software that you're suppose to maintain. I hate them both and while I can't refuse my crappy enterprise job I can say no to more, a lot more of it from AI.


When an LLM generates code, you don’t want it to generate author names! Maybe that’s the real reason it has to go in the training data?

Describing living authors accurately is sensitive enough that I think it’s going to require human oversight via Wikipedia. Justine has a Wikipedia page, but it could use some expansion on the software side. A project to improve Wikipedia by crediting authors of notable software might be useful? And LLM’s will pick it up from there.


> When an LLM generates code, you don’t want it to generate author names!

Who exactly is the "you" in that claim?

Justine clearly does want their author name (accurately) included.

I'd argue that anyone who's used a license with attribution requirements has explicitly stated they want that.

OpenAI et al. don't want that, for at least two reasons I can think of, 1) because their "artificial intelligence" isn't capable of doing it accurately or without "hallucinating" incorrect authorship attributions, and 2) because of the tsunami of copyright lawsuits that would immediately drown them.


I support Wikipedia. They do great work. AI systems can be complementary, because they've always been a raw source of information that requires your own research. People trust writers and bloggers to do that for them, and they trust Wikipedia to curate the consensus of books and blogs. So nothing would fundamentally change. The same people would simply be empowered with the next generation of tools. More precise and accurate ones that help them do their jobs better.


Yes, building a better system outside Wikipedia might be pretty great too.

LLM’s pretend to be general purpose, but maybe optimizing for code autocomplete versus searching over a knowledge graph are two different things and might end up as different subsystems.

Or maybe they’re just different kinds of training data? Like, strip the author data from the code (for autocomplete) but then use it to generate author pages to train on.


Ownership has no place in the noetic realm. Please, can we stop the encroachment of legal constructs beyond the physical world? It's too much. Authorship is like virginity, some people care about it way too much. Stop obsessing over it.


Even if you don't agree with intellectual property, adding provenance to LLMs is a problem worth solving. A solution is likely part of solving "hallucinations" and probably reasoning about truth in the world.


You're attacking a strawman. Authorship data like git commit history is about telling the story of how people brought a thing into existence. The journey can matter as much as the destination. Will AI be able to move forward on its own, if it has no knowledge of how it got there? I guess it's a question of how smart you really want it to be.


Intellectual property isn't a valid idea.


there are really no natural rights, except winning a contest. eat or be eaten, these are the only natural rights. every other right is a product of civilized society to put it simply. and civilized society thinks that renumeration for intellectual works is a good thing.

copyright looks to apply free market competition for authored works while protecting against those who wish to freeload. its even better than that, because freeloaders have the vast array of free stuff available and nobody would complain that you personally benefit from their use even if you don't contribute back.

but I do agree with you halfway, property rights are a social construct, but I disagree with the "invalid" part.


Great, so when are OpenAI and Anthropic going to make their LLMs public domain?


Then you must be thrilled with the way AI training is being practiced.


Yes, I enjoy the creation of new technologies which can make the legal fabrication of intellectual property terminally unfeasible.


They may have indemnity against rent-seeking but they can still be judged for their behavior.


Right, but I'm not advocating for actions within the current system. I'm advocating for a change to the current system by drawing a contrast between legality and reality.

We have a collective tendency to forget that we're the authors of legality and tend to reverse our causal arrows. Reality should be reflected in the law, not be shaped by the law.


Law and money are both old world concepts. I care about code and attention. I also believe in respecting other people. Law was invented to formalize deeper concepts like respect. Just because the law is broken, doesn't mean it's ok to start disrespecting others boundaries and intentions. All progress is played out above the legal waterline. Code doesn't exist until it's written. That's the way it's always been. We also have a voice in that process.


If you are the sole creator of a content, yes, you should be mentioned. But when 1000 others have written the same thing, like a ToDo app or a sorting algorithm, then it becomes impossible to name all the authors, or to disentangle their influences.

Technically it could be possible to have author metadata either prefixing or postfixing content, that way the model learns to generate from an author style or to predict the (likely) attribution of a snippet. Or they could create a BERT embedding model for author metadata and another for content, and train them with the CLIP method. So the model would learn to map text to author embeddings.

There are tons of data with author and date - books, journals, papers, media articles, blog posts, social network posts, github repos - they all provide a way to index ideas to their authors and in time, so we could potentially find who first invented an idea and who expanded on it later.

The negative effects could include disclosing authors of anonymous texts online, or disclosing the sources of influence of one author even when they keep them secret, or don't even realize themselves. Could be embarrassing to have your sources outed like that. You might find out your original ideas were invented elsewhere.


That's like saying Shakespeare shouldn't be credited for Romeo and Juliet because he's not the first guy to have written a love story and it's impossible to disentangle his ideas from that of Elizabethan culture. It's pure rubbish. When you're an open source volunteer who isn't being paid, credit is the only thing you can hope to gain and to deprive people of that is just cruel. If AI can consider author data, then we might even learn the truth about provenance too. It could illuminate who the unsung heroes are. You won't have to evangelize your work like PT Barnum to get recognition.


Even further: I don't think there is any evidence Shakespeare wrote his works as we know them. After his death some friends worked together to assemble his works


The topic wasn't really about something like "where did this bubble sort come from" but about bulk removal apparently before training, such that:

> .. if I ask something like Claude, "what sort of code has Justine Tunney wrote?" it hasn't got the faintest idea. Instead it thinks I'm a political activist, since it feels no guilt remembering that I attended a protest on Wall Street 13 years ago.


Note that the above article is not just about attribution but also about inclusion of licenses with the copies of licensed code, something that, as far as I can tell, no LLM ever does.

Is this because the code is so mashed together that it's impossible to say which bit was copied from which original source? Well, then, that's a very big problem and if a human did this we'd rightly label it plagiarism.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: