Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

intel­lec­tual prop­erty scan­ning

With "normal" code I can generally see (or figure out) who posted/published it and reach out for explicit permission. It's not uncommon for me to do this.

How is one supposed to do that for the generated stuff? Seems like an awefully hands-off attitude. As challenging as it is, they really ought to be qualifying the input samples of training code before ingesting.



There are some techniques used mostly to detect when students copy paste code. I've seen some of the tools in that space and they have varying degrees of accuracy. MOSS is a common one[0].

There are some vendors in this space too (BlackDuck comes to mind) but they're $$$ so only within the scope of large corporations.

If anybody has any ideas relating to this type of analysis, I'd be excited to chat. I am working on a project[1] in this space for "Software Composition Analysis" which could potentially overlap with snippet detection for code like Co-Pilot. (We basically just have a big pipeline of analysis jobs that run on code and store the results. I need to update the docs!)

0: https://yangdanny97.github.io/blog/2019/05/03/MOSS

1: https://github.com/lunasec-io/lunasec/tree/master/lunatrace


I don't think it's right to characterize it as hands off after they had their hands all up in the generated code. It's just malfeasant. They've produced a tool that is fundamentally (legally) unsafe to use and said that's not their problem.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: