I hear it all the time in the recent two years: web development is apparently dead now, because anyone can just slap together a web site or build a dashboard in minutes etc. because of LLMs.
But that has been true for decades: Templates, component libraries and so on.
But I guess it’s now easier to adjust them if you‘re not familiar with their configuration or something.
When I was younger this actually irked me a bit. I wasn't familiar with either, so it felt burdensome to me. The tooling also wasn't as good as it is now.
However there's no doubt that this is one of the primary reasons why Clojure became relevant and widely used (for a niche language). Seamless integration (or even improved integration) is very useful.
Another language that takes this approach is Zig. My intuition is that here as well, it's a unique selling point that will help with adoption and day to day usefulness.
PHP's performance can be significantly lower than JS, because it doesn't have application state (in a standard runtime/setup) and needs to re-run the entire application for every request. Now there are a whole bunch of tricks both in the language and with tooling to alleviate that, but still it's inherently there. It's an advantage for other reasons though.
There are advantages to the lack of application state, though. Memory leaks and similar bugs became largely irrelevant, for instance. Regarding performance, a simple LAMP stack on a dedicated machine can easily give you <250ms pageloads for many web apps. If that's not fast enough, or you're averaging dozens or hundreds of requests per second, you're probably big enough that you can use parallelization or more exotic architectures to speed things up.
> PHP's performance can be significantly lower than JS, because it doesn't have application state (in a standard runtime/setup) and needs to re-run the entire application for every request.
Sure, in PHP, the reality is that after your request is processed, all the state is garbage and is thrown out. But once you embrace that reality and stop trying to make sculpture from garbage, you can make some pretty damn fast pages that get straight to the point. Of course, a lot of people look at my fast PHP and say that it too is garbage, but at least it's fast garbage :P
> because it doesn't have application state (in a standard runtime/setup) and needs to re-run the entire application for every request.
Where "application" is basically a single page with less code than a typical React page. Even 20 years ago you'd run into DB struggling to give you data fast enough before you hit any issues with the "re-running the entire app".
And you have to screw your database really badly to see any issues early. Hell, phpBB was horrendously bad, running dozens of heavy DB queries on each page, and was still powering some of the internet's busiest forums.
> Now there are a whole bunch of tricks both in the language and with tooling to alleviate that, but still it's inherently there. It's an advantage for other reasons though.
Yes. It is an enormous advantage: it's fire and forget. You don't need to "SSR" your app (getting all data and state), ship it to the client with a bundle, then "re-hydrate" it (once again pulling data and state from the server) etc.
That ceased to be true a while ago since the ecosystem gravitated towards FrankenPHP, a stateful application server written in Go as a Caddy module. The performance is amazing, the Go Bridge allows easy extension, and it’s rock solid.
Don't look at "thinking" tokens. LLMs sometimes produce thinking tokens that are only vaguely related to the task if at all, then do the correct thing anyways.
Why does this comment appear every time someone complains about CoT becoming more and more inaccessible with Claude?
I have entire processes built on top of summaries of CoT. They provide tremendous value and no, I don't care if "model still did the correct thing". Thinking blocks show me if model is confused, they show me what alternative paths existed.
Besides, "correct thing" has a lot of meanings and decision by the model may be correct relative to the context it's in but completely wrong relative to what I intended.
The proof that thinking tokens are indeed useful is that anthropic tries to hide them. If they were useless, why would they even try all of this?
Didn't you notice that the stream is not coherent or noisy? Sometimes it goes from thought A to thought B then action C, but A was entirely unnecessary noise that had nothing to do with B and C. I also sometimes had signals in the thinking output that were red flags, or as you said it got confused, but then it didn't matter at all. Now I just never look at the thinking tokens anymore, because I got bamboozled too often.
Perhaps when you summarize it, then you might miss some of these or you're doing things differently otherwise.
The usefulness of thinking tokens in my case might come down to the conditions I have claude working in.
I primarily use claude for Rust, with what I call a masochistic lint config. Compiler and lint errors almost always trigger extended thinking when adaptive thinking is on, and that's where these tokens become a goldmine. They reveal whether the model actually considered the right way to fix the issue. Sometimes it recognizes that ownership needs to be refactored. Sometimes it identifies that the real problem lives in a crate that's for some reason is "out of scope" even though its right there in the workspace, and then concludes with something like "the pragmatic fix is to just duplicate it here for now."
So yes, the resulting code works, and by some definition the model did the correct thing. But to me, "correct" doesn't just mean working, it means maintainable. And on that question, the thinking tokens are almost never wrong or useless. Claude gets things done, but it's extremely "lazy".
Also, for anyone using opus with claude code, they again, "broke" the thinking summaries even if you had "showThinkingSummaries": true in your settings.json [1]
You have to pass `--thinking-display summarized` flag explicitly.
I agree. Ever since the release of R1, it's like every single American AI company has realized that they actually do not want to show CoT, and then separately that they cannot actually run CoT models profitably. Ever since then, we've seen everyone implement a very bad dynamic-reasoning system that makes you feel like an ass for even daring to ask the model for more than 12 tokens of thought.
Thinking summaries might not be useful for revealing the model's actual intentions, but I find that they can be helpful in signalling to me when I have left certain things underspecified in the prompt, so that I can stop and clarify.
They also sometimes flag stuff in their reasoning and then think themselves out of mentioning it in the response, when it would actually have been a very welcome flag.
This can result in some funny interactions. I don't know if Claude will say anything, but I've had some models act "surprised" when I commented on something in their thinking, or even deny saying anything about it until I insisted that I can see their reasoning output.
Thinking helps the models arrive at the correct answer with more consistency. However, they get the reward at the end of a cycle. Turns out, without huge constraints during training thinking, the series of thinking tokens, is gibberish to humans.
I wonder if they decided that the gibberish is better and the thinking is interesting for humans to watch but overall not very useful.
OK so you're saying the gibberish is a feature and not a bug so to speak? So the thinking output can be understood as coughing and mumbling noises that help the model get into the right paths?
Here is a 3blue1brown short about the relationship between words in a 3 dimensional vector space. [0] In order to show this conceptually to a human it requires reducing the dimensions from 10,000 or 20,000 to 3.
In order to get the thinking to be human understandable the researchers will reward not just the correct answer at the end during training but also seed at the beginning with structured thinking token chains and reward the format of the thinking output.
The thinking tokens do just a handful of things: verification, backtracking, scratchpad or state management (like you doing multiplication on a paper instead of in your mind), decomposition (break into smaller parts which is most of what I see thinking output do), and criticize itself.
An example would be a math problem that was solved by an Italian and another by a German which might cause those geographic areas to be associated with the solution in the 20,000 dimensions. So if it gets more accurate answers in training by mentioning them it will be in the gibberish unless they have been trained to have much more sensical (like the 3 dimensions) human readable output instead.
It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.
> It has been observed, sometimes, a model will write perfectly normal looking English sentences that secretly contain hidden codes for itself in the way the words are spaced or chosen.
This sounds very interesting, do you have any references?
no, he's saying that in amongst whatever else is there, you can often see how you could refine your prompt to guide it better in the firtst place, helping it to avoid bad thinking threads to begin with.
It's certainly interesting that they provide an email service now. In their documentation/blog recommendations they switched their recommended approach twice or three times already.
If they establish a solid email solution I will likely use that for some of the projects I'm hosting there.
No - this model has the weights memory footprint of a 35B model (you do save a little bit on the KV cache, which will be smaller than the total size suggests). The lower number of active parameters gives you faster inference, including lower memory bandwidth utilization, which makes it viable to offload the weights for the experts onto slower memory. On a Mac, with unified memory, this doesn't really help you. (Unless you want to offload to nonvolatile storage, but it would still be painfully slow.)
All that said you could probably squeeze it onto a 36GB Mac. A lot of people run this size model on 24GB GPUs, at 4-5 bits per weight quantization and maybe with reduced context size.
I hear it all the time in the recent two years: web development is apparently dead now, because anyone can just slap together a web site or build a dashboard in minutes etc. because of LLMs.
But that has been true for decades: Templates, component libraries and so on.
But I guess it’s now easier to adjust them if you‘re not familiar with their configuration or something.
reply