Claude code and cursor agent and all the coding agents can and do run MCP just f...

TZubiri · 2026-02-01T19:18:26 1769973506

Here's the specific source on this matter:

https://youtu.be/VsTbgYawoVc?si=6ZE83umppNCz9h-a&t=1021

"The models today, they are tuned to call specific tools. We've played with a lot of tools, you can hand it a bunch of tools it's never seen before and it just doesn't call them. There's something to the post-training process being catered to certain sets of tools. So anthropic, cloud 4, cloud3.7 before that, those models are the best at calling tools from a programming standpoint. They'll actually keep trying and going for it. Other models like Gemini2.5 can be really good, but it doesn't really call tools very eagerly. So we are in the state right now where we kind of have to provide the set of tools that the model expects.

I don't think that'll always be the case, but we've given it a bunch of LSP tools. I've played with giving, say, giving it 'go to definition' and 'find references', and it just doesn't use them. I mean you can get it to use them if you ask it to, but it doesn't default to kind of thinking that way, I think that'll change."

He then goes on to theorize it's the System Prompt, so open models like Llama where you can customize the system prompt might have an advantage. (I think API models still have a prebaked prompt, not sure). Additionally, even when you control the prompt, he argues there's a (soft) limit to the amount of tools it can handle.

Personally, I think a common error with LLMs is conflating what is technically possible and what works in practice. In this case the argument is that custom tools and MCPs are possible, but it's limited in the sense that "you often need to explicitly tell them to use such tool, you can only have a small amount of custom tools", when you compare it to system prompt specified tools, and tools in the training set which are fine tuned to, it's a whole different category, the native tools are just capable of way much more autonomy.

A similar error I've seen is conflating the context length to the capacity to remember. That a model has a 1M token window means that it could remember something, but it would be a categorical mistake to claim or depend on the model remembering stuff in a 1M token conversation.

There's a lot of nuance in these discussions.

oofbey · 2026-02-01T21:34:36 1769981676

Another common mistake today is to observe one LLM failing to do something in a single situation, and to generalize that observation to "LLM's are incapable of doing this thing." Or "they're not good at this kind of thing" which is what you're repeating here. This logic underlies a lot of AI skepticism. Sure you and they aren't skeptics and acknowledge this will get better. But I think you're over-indexing on a specific problem they observed. Plus to blame the LLM when they haven't optimized the system prompt is IMHO quite silly - it's kind of like "did you read the instructions you were giving it?". What I think they should say is "I tried this and it didn't work super well out of the box. I'm sure there's some way to fix it, but I haven't found it yet." Instead of blaming the model intrinsically.

In contrast, I've seen coding agents figure out extremely complex systems problems that are clearly outside of their training set - by using tools and interacting with complex environments, and reasoning and figuring it out.

Plus, "tools" can be multi-layered. You give an agent a "bash" tool, and voila, it has access to every piece of software ever written. So I don't think any of these arguments apply in the slightest to the question of de-compiling code.

TZubiri · 2026-02-02T02:27:10 1769999230

>"What’s the state of the art of reverse engineering source code from binaries in the age of agentic coding?"

This is the original comment I was responding to, we were talking about state of the art Agentic models. So not generalizing to other scenarios.

>Sure you and they aren't skeptics and acknowledge this will get better

I think this is a common bipartisan trap where you lose a lot of nuance. And it's imprecise, you don't know whether I'm a skeptic or not. It's like reading a nuanced opinion and trying to see if they are Republican so you can agree or Democrats so you can disagree.

>Plus to blame the LLM when they haven't optimized the system prompt is IMHO quite silly - it's kind of like "did you read the instructions you were giving it?"

I think the context here is that when using agentic tools like Claude Code, you don't control the system prompt. You could write your own prompts and use naked API calls, but that's always more expensive, (because it's subsidized), and I'm not sure what the quality of that is.

The bottom line is that API calls, where you can fully control the system prompt, are more expensive. And using your OpenAI/Anthropic subscription has a fixed cost. So in that context they don't control the system prompt.

Even in cases where you could control the system prompt and use the API, there's the fact that some models (the state of the art) are fine tuned for specific tool use, so they have a bias towards fine tuned tool use. The claim is not that they are "incapable of doing X thing" it's that it's a bias towards the usage that was known at train-time or fine-tune time, instead of at inference which is much weaker. Nuance.

>Instead of blaming the model intrinsically.

Again, not blaming or being a skeptic here, just analyzing the state of the art and its current weakness, it's likely that these things are going to be improved in the next generation, this is going to move fast, if you conflate any criticism of the tools with "skepticism" you are going to miss the nuance.

>In contrast, I've seen coding agents figure out extremely complex systems problems that are clearly outside of their training set - by using tools and interacting with complex environments, and reasoning and figuring it out.

Yeah for sure, I'll give you a concrete example on this point where we agree. I made a model download a webdriver for a browser and taught it to use the webdriver to open the site, take screenshots, and evaluate how it looks visually, in addition to actually clicking buttons and navigating it. This is a great improvement when the traditional approach is just to generate frontend code and trust that it works (which to be fair, sometimes works great, but you know, it's better if it see that.)

And it works, until it doesn't and I have to remind it that it can do that. It's just a bias. If they would have trained the model with WebDriver tool access, the model would use it much more (and perhaps they are already doing that and we will see it in the next model.)

The main thesis is that instructions taught at train time 'work better' than at fine-tune time which in turn are stronger than 'inference'. To be very specific during inference tool use is much more likely immediately after mentioning it, it might be stronger more consistently at the system prompt, (but it competes with other system prompt instructions and it's still inference based). To say nothing of the costs associated with adding to inference tokens, compared to essentially free training/finetune biases. I don't think anyone disagrees that stuff you teach the model during training has better quality and less cost than stuff you teach at inference.

I think playing around with logit biases is an underrated tool to increase and control frequency of certain tools, but it doesn't seem that's being used much in this generation of vibecode tools, the interface is almost entirely textual (with some /commands starting to surface). Maybe the next generation will have the option to configure some specific parameters instead of entirely relying on textual prompting.