Hacker Newsnew | past | comments | ask | show | jobs | submit | kgeist's commentslogin

Maybe running additional inference on all sessions to detect OpenClaw usage would require spending more money than they would save with that detection in the first place (which is the original goal). I also suspect the Claude Code team is just a regular software team without immediate access to ML pipelines (or competence to run them) to quickly develop proper abuse detection systems with extensive testing (to avoid false positives, which people would also complain about), and they're under pressure by the management to do something right now, so a regex is all they can do within those constraints.

The benchmark is strange: single-run results (the author acknowledges it's unreliable) and uses older models like GPT-4o or Opus 4 (although the benchmark is from 2026).

>The short answer is that variable names are one of the things that confuses LLMs rather than helps them. Unlike with humans, names undermine a model's efforts to keep track of state over larger scales. Models confuse similarly named variables in different parts of the codebase easily

So I wonder, doesn't this apply to function names too, which the author keeps in? I've seen LLMs use wrong functions/classes as well.

I think a proper harness, LSP and tests already solve everything Vera is trying to solve. They mostly cite research from 2021 before coding harnesses and agentic loops were a thing, back when they were basically trying to one-shot with relatively weak models (by modern standards)


The only way the author could have come up with that rationale is that he doesn't understand what a token is, what attention is and how coding agents work.

Tokens combine multiple characters into a single vector. Attention computes similarity scores between vectors. This means you'd want each variable to be a single token so that the LLM can instantly know that two names refer to the same variable. If everything is numbered, the attention mechanism will attend every first parameter to every first parameter in every function. This means that the numbering scheme would have to be randomized instead of starting at zero.

Coding agents are now capable of using tools, including text search, which means that having the ability to look for specific variable names is extremely helpful. By using numbering, the author of the language has now given himself the burden of relying entirely on LSPs rather than innate model properties that operate on the text level.

So yeah, on a textual level, the language is designed for an era of LLMs that has been obsolete for a long time.


We're planning to do the same thing - buy something like 8xH100 and run all coding there. The CTO almost agreed to find the budget for it but I need to make sure there are no risks before we buy (i.e. it's a viable/usable setup for professional AI-assisted coding)

Can you share what models you run and find best performing for this setup? That would help a lot. I already run a smaller AI server in the office but only 32b models fit there. I already have experience optimizing inference, I'm just interested what models you think are great for 8xH100 for coding, I'll figure out the details how to fit it :)


8 x h100 80's don't give you enough to run the latest 1tn + parameter models (especially at the context window lengths to be competitive with the frontier models)

Verda has B300 clusters, 8 for USD $55/hour in 10 minute billing blocks

Deepseek, GLM, Minimax or Kimi are the most likely contenders.

I’ve been using kimi 2.5/2.6 for the past 2 weeks and it’s really not far off OpenAI and Claude models. I am a coder so it’s not all vibes but I am definitely more in the “spec to code” mode than “edit this file for me” and it copes just fine. Needs a bit more supervision than the frontier models but it’s also significantly cheaper. If I were anthropic I’d be shitting myself, their prices are going to 10x over the next 2 years

So are you running Kimi on Verda?

Check out Verda you can rent whatever super powerful GPU clusters you need in 10 minute increments. Deploy any open weight model using SGLang and away you go

How do you validate that the reports are correct? What if an executive makes a wrong business decision because the LLM wrote a wrong SQL query?


> What if an executive makes a wrong business decision

I jokingly tell students, "We all know executives are gonna make bad decisions no matter what the data says. Might as well give them the random numbers more quickly."


The same way we've always done it - glance at it and see if the numbers look like they're within an order of magnitude of what looks reasonable.

so what if there were some numbers in the report which are in actuality, an order of magnitude or two outside of what you think is reasonable, because something was wrong, but the AI agent reports something that looks normal?

So as long as the LLM only makes errors in the single-digit percentage range, everything is peachy. Make number go up, but not by too much.

If you already know the report's numbers, why are you asking an LLM to generate it?

Usually because you need something vaguely technical and authoritative sounding to push for a decision you're already made.

>Stash makes your AI remember you. Every session. Forever.

How does it fight context pollution?


Custom constrained decoding could have solved this. Penalize comment tokens :)

Interesting, my assumption used to be that models over-edit when they're run with optimizations in attention blocks (quantization, Gated DeltaNet, sliding window etc.). I.e. they can't always reconstruct the original code precisely and may end up re-inventing some bits. Can't it be one of the reasons too?

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

>Latency, throughput, and routes don't matter here. When it's 10 seconds for the first token and then a 1KB/sec streamed response, whatever is fine. You can serve Australia from the US and it'll barely matter.

This may be true for simpler cases where you just stream responses from a single LLM in some kind of no-brain chatbot. If the pipeline is a bit more complex (multiple calls to different models, not only LLMs but also embedding models, rerankers, agentic stuff, etc.), latencies quickly add up. It also depends on the UI/UX expectations.

Funny reading this, because the feature I developed can't go live for a few months in regions where we have to use Amazon Bedrock (for legal reasons), simply because Bedrock has very poor latency and stakeholders aren't satisfied with the final speed (users aren't expected to wait 10-15 seconds in that part of the UI, it would be awkward). And a single roundtrip to AWS Ireland from Asia is already like at least 300ms (multiply by several calls in a pipeline and it adds up to seconds, just for the roundtrips), so having one region only is not an option.

Funny though, in one region we ended up buying our own GPUs and running the models ourselves. Response times there are about 3x faster for the same models than on Bedrock on average (and Bedrock often hangs for 20+ seconds for no reason, despite all the tricks like cross-region inference and premium tiers AWS managers recommended). For me, it's been easier and less stressful to run LLMs/embedders/rerankers myself than to fight cloud providers' latencies :)

>then put all of your data centers there

>You definitely don't need a data center in every continent.

Not always possible due to legal reasons. Many jurisdictions already have (or plan to have) strict data processing laws. Also many B2B clients (and government clients too), require all data processing to stay in the country, or at least the region (like EU), or we simply lose the deals. So, for example, we're already required to use data centers in at least 4 continents, just 2 more continents to go (if you don't count Antarctica :)


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: