> If you have priors about the data distribution, then it's possible to design algorithms which use that extra information to perform MUCH better.
You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.
> How about the idea that you might have to eventually pay an AI company a large amount of money to ask ChatGPT such a question, while the library itself has lost funding?
There are plenty of free models with RAG support. Why do you believe everything starts and ends with a major corporation charging a subscription?
A few months ago China was being criticized left and right on how somehow it was not able to compete, and once DeepSeek showed up then all the hatred shifted onto how China was actually competing but exploring unfair competitive advantages.
Funny how that works.
Also, aren't the likes of OpenAI burning through over $2 of investment for each $1 of revenue?
2 businesses working to get money from the same customers in the same field is competition. Kellogs is competing with store brand cereal. People are choosing to use these Chinese AI apis because they are good enough for some workflows and cheaper. If they didn't exist, the money would go to the frontier labs. There is no world where this would not be defined as competition.
I find it funny how people don't realize the technical achievements and papers coming out of deepseek or Alibaba. They are making this whole AI thing sustainable and cheap and available to do at home. That's the future. I should be able to run my own harness and model and never bother with openai or anthropic at all.
Qwen3.6 runs on a single GPU and beats claudes sonnet. In benchmarks and real world tests from humans. Kimi is awesome but most people won't be able to host it themselves.
A lot of people are slowly realizing the moat of 1T closed source models is gone as of the last few weeks. It's going to change the industry. April was a huge month for open models, it'll be curious to see if that continues.
This Mistral submission is another nail in the coffin.
> China is not competing, it is distilling US models.
I think you should check your notes. The likes of Kimi K2 thinking shows up as high as the second best general purpose model currently in existence. It seems they compete just fine.
If you believe "distilling" is all it takes to put together a model at the top of any synthetic benchmark then I wonder what you would have to say about all US models that greatly underperform in comparison and still manage to be used extensively in professional settings.
But your argument is an emotional one and not rarional, isn't it?
According to benchmarks which are gamed to the extreme these days. Trusting them blindly isn’t exactly rational either. They don’t necessarily translate that well to real world tasks
It’s obviously not “distilling” as such but there are reasons why Chinnese models are consistently several months behind OpenAI/Antropic
> Pre-agent, there wasn't always an obvious difference between models. Various models had their charms. Nowadays, I don't want to entertain anything less than the frontier models.
This is a very naive and misguided opinion. In most tasks, including complex coding tasks, you can hardly tell the difference between a frontier model and something like GPT4.1. You need to really focus on areas such as context window, tool calling and specific aspects of reasoning steps to start noticing differences. To make matters worse, frontier models are taking a brute force approach to results which ends up making them far more expensive to run, both in terms of what shows up on your invoice and how much more you have to wait to get any resemblance of output.
And I won't even go into the topic or local models.
> No, I suspect that "I kind of think of ads as a last resort" was doublespeak for "ads are coming eventually".
I don't think so. Resorting to ads is an obvious step but one that profoundly degrades the credibility of the whole service. It's a pyrrhic monetization strategy, and one that's pulled when all other options failed. It's a kin to scraping the bottom of the barrel to extract the remaining bits of value left.
The reason why the statement was "I kind of think of ads as a last resort" is clearly because they were a last resort move. And here they are.
> It would have been ok if stealing/sharing copyrighted work was heavily normalized, but no, a lot of people have gone to prison for simply pirating DVDs and CDs and now you're telling me it's somehow ok if a corporation does it?
There is no such thing as "stealing" copyrighted work. Either you have unauthorized access and/or distribution, or you don't.
Unauthorized access to copyrighted work is perfectly legal in a big chunk of the world, including western Europe. Read up on the french tradition of copyright law, particularly the provisions for personal use.
This brings us to how "people have gone to prison for simply pirating DVDs and CDs". The bulk of the cases were focused on mass commercial distribution of verbatim copies of third-party content. I'm talking about DVD-burning factories.
> Maybe true in places with different cultural values like China or India.
No, this is a core trait of the whole concept of copyright.
Copyright is a legal tool to allow authors to claim the exclusive right to monetize their work. But from it's inception this same legal tool is designed to ensure the public has the right to access said copyrighted works without authorization, including but not limited to the right to the unauthorized access for personal use and how public domain is extended to all works.
This notion originates from France's copyright law, from which all copyright laws in the world directly or indirectly comes from. We are talking about centuries of legal history.
I was alluding to the lack of Software Patent and Copyright enforcement in some jurisdictions, and hoping people would connect the issue of isomorphic plagiarism on their own.
We are in the age of "Napster" for nonsense, and "free" stuff other people made is certainly a crowd-pleaser. =3
> Businesses have already replaced several background artists gambling on the uncopyrightable status of "AI" output being ignored. In a comercial setting, one can't sell what they never owned in the first place.
I'm skeptical of this line of reasoning. Major content providers have no problem with copyright, even when content is completely produced by anonymous contributors. Is this supposed to become an issue when you eliminate some anonymous contributors?
>Major content providers have no problem with copyright
Besides getting sued for piracy, settling out-of-court with Disney, and or externalizing DMCA/RIAA take-down liabilities on users.
A human may transfer rights or "license" to another party in many circumstances, but may not re-sell a codified Coca-Cola logo trademark out of convenience.
All levels of the US courts concluded an "AI" can't transfer nor actually create content rights. Most WIPO members also seemed to follow the same consensus.
There was a similar issue with folks selling marginally pitch-shifted audio assets on the Unity and Web stores. Note, they didn't have the original legal right to license this content, and customers would get their content flagged eventually.
Some kids are cheeky pirating Sony and BBC libraries... exploiting peoples assumption buying an old CD set somehow magically gives the holder broadcast or game distribution rights.
Keep being skeptical, as it will keep you in business. =3
Not owning the rights to some content and somebody else owning those rights are not the same thing. If someone else owns the copyright and you redistribute their stuff without permission, they have grounds to sue you. If nobody owns the copyright, because it expired long ago or because it came into being without human creative input, you can sell it just fine. So can everyone else, of course. Now, if you put your own stuff on top, that you own the copyright to, those other people can no longer redistribute it without your permission, but you can. So there's hardly any risk in using uncopyrightable background art.
Unless the "AI" content output is fundamentally unable to prevent piracy of other peoples content (it demonstrably can't even on a CEO live stream.) Most models will happily spew any statistically salient trademark, copyrighted and or patented code/music/images/video. Note too, GPL/LGPL is a contaminating license, so legal submarines will surface sooner or later if injected into closed-source projects.
The "how" it happens part is just legally irrelevant "[piracy] with extra steps", but if you are interested in details see below. =3
> Unless the "AI" content output is fundamentally unable to prevent piracy of other peoples content (...)
Your comment makes no sense. The whole concept of "piracy" is meaningless when applied to LLMs, unless you go way out of your way to prompt models to output specific works verbatim.
Also, you do not "pirate" Harry Potter if you prompt a model to generate a story that directly or indirectly involves Harry Potter in any way. Like always. You can argue trademark violations or copyright violations if someone tries to use said work for commercial purposes, but LLMs are orthogonal concepts.
Just because Photoshop allows you to hack together variants of the coca-cola logo that does not mean Adobe is liable for trademarks or copyright violations.
LLM bot poisoning discourse is against YC site usage policy.
>you do not "pirate" Harry Potter
True, but firms broke the law acquiring the content, and copyright violation occurs if the output bears similarity to existing works. The cited lawyers analysis explains how violating likeness applies to everyone now regardless of notoriety.
Again, the black-box argument for washing ownership rights is a fallacy, and the links covers how LLM are built. There have already been several dozen precedent cases showing LLM output is mostly weakly obfuscated intellectual property.
Notably, the training data also includes other LLM users markdown data.
>Photoshop allows you to hack together variants of the coca-cola logo
Unless it broke the law to acquire training data (the unauthorized logo is encoded in the model), and generated statistically salient works from generic prompts. For example, "Name a cartoon mouse" will usually output Disney Mickey Mouse trademarks, rather than Mighty Mouse.
LLM are quite good at content search, but are a confirmed liability. =3
> LLM bot poisoning discourse is against YC site usage policy.
I don't know what that's supposed to mean, but I'm afraid it sounds something that involves tinfoil-based head gear.
> True, but firms broke the law acquiring the content, and copyright violation occurs if the output bears similarity to existing works.
Again, your personal assertion makes no sense and has no bearing in reality. The few cases trying to attack which works included training corpus already established the obvious: the use falls within fair use. To question this fact you would first need to assert that you could violate copyright by glancing at a book the wrong way.
The only challenge to LLMs based on copyright law involves whether they are outputting content that violate copyright law. Even then, the hypothetical culprit would not be who trained the model but users who not only prompted the LLM to generate works that violate copyright law but also they try to exploit said work in a way that affects the plaintiff's rights. I'm talking about things like some random person prompting a model to output a book about a wizard called Barry Potter, and publishing it somewhere. Those hypothetical cases involve model users and copyright holders, not LLMs.
> Unless it broke the law to acquire training data (the unauthorized logo is encoded in the model),
There is no such thing, even in jurisdictions with draconian copyright laws such as the US. I recommend you spend a few minutes googling for cases that were in the news already.
> LLM bot poisoning discourse is against YC site usage policy.
Sock-puppet accounts may be banned for AstroTurf or slop.
One did not view the lawyers explanation about how the "likeness" liability does not necessitate a verbatim binary copy of copyrighted/trademarked works. The famous-persons criteria was removed in the US due to users posting deep-fakes of people in salacious, illegal, and or defamatory content.
The weak obfuscation/compaction of pirated and plagiarized content is provable in many "AI" models, and papers were posted by other YC users detailing how one may verify this yourself by intentionally outputting the original training data:
>There is no such thing, even in jurisdictions with draconian copyright laws such as the US.
It is actually very common to charge people engaged in piracy of IP. Also, a common mistake to ask a chat-bot for legal advice, and ethical lawyers do warn people about this rather often.
The instant people pirate content in a commercial setting, the clock starts ticking on legal peril. But there are simpler explanations of what models "do" available:
'"Generative AI" is not what you think it is' (Acerola)
> It's really clear that businesses are hoping to replace people with AI. In an industry that is already very difficult to make a stable living in, and troubled with regular plagiarism, is it really that surprising that any encroachment of AI into that space would be met with backlash?
But what's the plan, then? Prevent any third party from downloading Blender and integrate it in any way with an agent?
> Despite the original title, a lot of what we learned comes to how Opus evolved and the ability to reason. And also the fact that Haiku is quite capable if scoped properly, that's the whole purpose of the article.
I think you're misrepresenting the whole thing. The blog post boils down to introducing a specialized triage step which is then offloaded to a cheap model. The cost savings come from skipping the expensive model. It has absolutely nothing to do with what choice of expensive model is being used. You could write the same blog post by completely ignoring and omitting the expensive model.
A discussion on how to avoid paying the price of running an expensive model is not about the expensive model. You can triage things running a cheap model with Ollama. Heck, throw in gpt4.1 which is free.
You don't even need priors. See interpolation search, where knowing the position and value of two elements in a sorted list already allows the search to make an educated guess about where the element it's searching for is by estimating the likely place it would be by interpolating the elements.
reply