Years ago I was making the case that instead of digging ourselves into the Amazon eco-system with S3 storage, EC2 instances, DynamoDB and various other Amazon specific cloud products... we should just host virtual machines and have everything in there using open source products.
People looked at me like they saw water burning but that would have made the dependency on the US a lot easier to sever. Just move the VM's.
Feel a bit bad for Yomif Kejelcha who also broke the 2-hour mark, with this being his first competition marathon, but managed to neither break a record nor win.
It is fundamental to language modeling that every sequence of tokens is possible. Murphy's Law, restated, is that every failure mode which is not prevented by a strong engineering control will happen eventually.
The sequence of tokens that would destroy your production environment can be produced by your agent, no matter how much prompting you use. That prompting is neither strong nor an engineering control; that's an administrative control. Agents are landmines that will destroy production until proven otherwise.
Most of these stories are caused by outright negligence, just giving the agent a high level of privileges. In this case they had a script with an embedded credential which was more privileged than they had believed - bad hygiene but an understandable mistake. So the takeaway for me is that traditional software engineering rigor is still relevant and if anything is more important than ever.
ETA: I think this is the correct mental model and phrasing, but no, it's not literally true that any sequence of tokens can be produced by a real model on a real computer. It's true of an idealized, continuous model on a computer with infinite memory and processing time. I stand by both the mental model and the phrasing, but obviously I'm causing some confusion, so I'm going to lift a comment I made deep in the thread up here for clarity:
> "Everything that can go wrong, will go wrong" isn't literally true either, some failure modes are mutually exclusive so at most one of them will go wrong. I think that the punchy phrasing and the mental model are both more useful from the standpoint of someone creating/managing agents and that it is true in the sense that any other mental model or rule of thumb is true. It's literally true among spherical cows in a frictionless vacuum and directionally correct in the real world with it's nuances. And most importantly adopting the mental model leads to better outcomes.
> If the job were mainly about producing syntactically valid code, then of course A.I. would be on a direct path to replacing large parts of the profession. But that was never the highest-value part of the work. The value was always in judgment.
> The valuable engineer is the one who sees the hidden constraint before it causes an outage. The one who notices that the team is solving the wrong problem. The one who reduces a vague debate into crisp tradeoffs. The one who identifies the missing abstraction. The one who can debug reality, not just read code. The one who can create clarity where everyone else sees noise
How do you think engineers in the second half got there? By writing tons and tons of code to "build those reps" and gain that experience.
The author tries to answer this:
> That process is not optional. It is how engineers acquire and elevate their competency. If early-career engineers use A.I. to remove all struggle from the learning loop, they are hurting their development.
but in a world wherein writing code by hand (the "struggle") is "artisinal" and "outdated", this process being non-optional (which I agree with) is contradictory.
How juniors and fresh grads do that with AI that is designed to give you whatever answer you need in a given moment is unclear to me. I don't see how that's possible, but maybe I'm thinking too myopically.
> One defining constraint must shape the product... Minecraft is built entirely from blocks. IKEA is flat-pack, self-assembly furniture.
I've been calling these things product primitives. I can't remember where I heard that term, but it refers to things like...
Blocks in Notion. Messages and conversations in Telegram. Frames and layers in Figma. Tweets in Twitter. Cells and sheets in Excel. Tools and layers in Photoshop. Commands in a CLI.
I think what makes for good product design is having a very small number of primitives. A bad product doesn't know what its primitives are. Or it has a very large number of primitives. It feels like everything in the product is some unique thing that works in its own unique way. So users have to learn a ton of different top-level primitives/concepts. It's confusing and intimidating and hard to teach. Ideally you just want one or two or three main primitives.
The complexity/power in an app comes from choosing powerful primitives that have depth, that are composable, etc. You can do a lot with Notion blocks. You can do a lot with Excel cells. You can do a lot with a CLI command. You can do a lot with a Minecraft block. There's depth there.
"Your plan pricing is unchanged: Copilot Pro remains $10/month and Pro+ remains $39/month, and each includes $10 and $39 in monthly AI Credits, respectively."
If there's no discount on credits (in terms of tokens per dollar) over other providers, I'm going to switch to a PAYG provider. If there's a month where there's little to no coding I can pocket the 10$. What incentive do they give to stay with this plan?
1. SWE-bench Verified is now saturated at 93.9% (congrats Anthropic), but anyone who hasn't reached that number yet still has more room for growth.
2. SWE-bench Multilingual and SWE-bench Multimodal (which we'll open source in the next month) are still unsatured.
3. All benchmarks and benchmark paradigms eventually become saturated. That's why the SWE-bench team has worked hard on building the next stage of benchmarks, and we have a few that are already out, for example https://codeclash.ai/ or https://algotune.io/ . And we'll have more to say soon :)
Giving LLM agents direct, autonomous access to a real production databases with write access seems insane to me.
NO ONE, agent or human, should have direct write access to production databases outside of emergency break glass scenarios. This is why we have stored routines and API layers to pre-define what writes are allowed. The facts that agents CAN autonomously write to a database does not imply that they should.
For the point about query optimization, again your agents should not be issuing random queries against a production database. We have had the concept of separate analytics databases with different architectures to support exporatory queries for decades.
For the record, it's failing silently, too, showing e.g. "There aren’t any open pull requests." even though there are dozens. That's pretty bad, this will definitely mislead people.
I don't think this is a minor point. It seems clear by this point that the author is clueless how even API works and are just trying to shift blame for third-parties instead assuming that they're just vibecoding their whole product without doing proper checks.
Yes sure, there seems to be lots of ways this issue could have been mitigated, but as other comments said, this mostly happened because the author didn't do its proper homework about how the service they rely their whole product works.
The irony is how difficult it is to read this obviously AI-generated article due to its unnatural prose and choppy flow full of LLM-isms. The ability to write is also a skill that atrophies.
Even when AI is understandably used due to language fluency, I’d prefer to read an AI translation over a generated article.
If you don’t care enough to write it, why should I care enough to read it?
The most aggravating fact here is not even AI blunder. It's how deleting a volume in Railway also deletes backups of it.
This was bound to happen, AI or not.
> Because Railway stores volume-level backups in the same volume — a fact buried in their own documentation that says "wiping a volume deletes all backups" — those went with it.
>.. macOS only ever programs CS42L84 to operate at either 48 or 96 kHz, we could only add support for those two sample rates to the Linux driver ..
> However, CS42L42 supports all the other common sample rates, and while the register layout and programming sequence is different, the actual values programmed in for 48 and 96 kHz are the same across both chips. What would happen if we simply took the values for all other sample rates from the CS42L42 datasheet and added those to the CS42L84 driver? As it turns out, you get support for those sample rates!
> The patch to enable hardware support for 44.1, 88.2, 176.4 and 192 kHz sample rates on both the input and output of the headphone jack was submitted directly upstream, and has been merged for 7.1. We also backported this to Asahi kernel 6.19.9, allowing users to take advantage of this immediately.
Nice bit of chip sleuthing and reverse engineering from the Asahi team!
I'm glad this article includes the only credible fix for the HTTP leak problems: CSP.
A useful thing I learned recently is that, while CSP headers are usually set using HTTP headers, you can also reliably set them directly in HTML - for example for HTML generated directly on a page where HTTP headers don't come into play:
It feels like this shouldn't work, because JavaScript in the untrusted content could use the DOM to delete or alter that meta tag... but it turns out all modern browsers specifically lock that down, treating those CSP rules as permanent as soon as that meta tag has loaded before any malicious code has the chance to subvert them.
I think the tapping phones feature -- for initial friend creation, not upkeep -- is THE killer feature of the app.
Do I want my teens on any social media apps? No.
Would I let them be on Facebook of 2006, when you were just connected to your friends and family, and not influencers and "the algorithm?" Sure! That and early Instagram were great ways to keep up with real-life friends.
If you made this as easy and pleasant to scroll through as 2011 Instagram was, with only-real friends allowed, I might even return to social media myself. It would beat having to WhatsApp my family my vacation photos.
(And heck, if this got big enough that celebrities were bumping phones with fans, heck, at least that's a more intentional connection than Insta forcing the latest wellness guru on my teen girl.)
If the API replied "Are you sure (Y/N)?" the AI, in the mode it was in, guardrails completely pushed off the side of the road, it would have just said "Yes" anyway.
If you needed to make two API calls, one to stage the delete and the other to execute it (i.e. the "commit" phase), the AI would have looked up what it needed to do, and done that instead.
4. Allows the model to execute code to analyze things on the fly, so the model can simply write bash/python/perl script to accomplish things where appropriate
5. A lot of context curation and opportunistic context updates, i.e. put into context anything that you are certain model would ask next
I often get in trouble on HN for being more sympathetic than most towards Apple. But that reasoning by Apple is ridiculous. They allow apps which only function if you buy a specific $100k+ EV, or some niche audiophile amp. Usefulness doesn’t get much more limited than that.
This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.
As someone who's maintained a meditation practice since 2013, this is definitely meditation.
And by "maintain a practice", I mean it's more like something I return to with frequency and less a daily compulsion.
Focusing on the breathe or ambient sounds is "easy", and is precisely the reason meditation is seemingly difficult. The mind craves more than simplicity; for some this occurs after a few seconds, for others after a few minutes...it all depends on the day. Learning to observe when the mind wanders is one part of the practice. Labelling the quality of thought that caused the wandering (planning, worrying, visualizing, replaying, etc)and returning to the simpler act of focus on breathe or sounds is another part of the practice.
This article is very much the author discovering some variation of meditation; if they feel the need to "invent" something and share it in a blog post...then here's hoping it promotes more people to give it a shot and maybe it'll lead to at least one person developing a new practice for themselves.
This agreement feels so friendly towards OpenAI that it's not obvious to me why Microsoft accepted this. I guess Microsoft just realized that the previous agreement was kneecapping OpenAI so much that the investment was at risk, especially with serious competition now coming from Anthropic?
Lots of us have noticed that usage limits for Claude have been nerfed in recent weeks/months.
If anything, these new multipliers are more transparent than anything OpenAI or Anthropic have communicated regarding actual costs and give us a more realistic understanding of what it's costing these providers.
The fact that we were able to get such a substantial amount of usage for $20/$100/$200 a month was never meant to last and to think otherwise was perhaps a bit naive.
This feels like a strategy from the ZIRP era of tech growth where companies burned investor capital and gave away their products and services for free (or subsidized them heavily) in order to prioritize user acquisition initially. Then once they'd gained enough traction and stickiness they'd then implement a monetization strategy to capitalize on said user base.
Likely an inside job. I had a similar experience with AWS where my account was compromised despite the fact that I had all the proper security features enabled. It was later discovered internal contractors were responsible. But up to that point AWS blamed the issue on me with no proof. A call to the AG office in my state got the ball rolling and initiated an investigation that finally got a manager to take the case seriously.
A wise man from Google said in an internal memo to the tune of:
"We do not have any moat neither does anyone else."
Deepseek v4 is good enough, really really good given the price it is offered at.
PS: Just to be clear - even the most expensive AI models are unreliable, would make stupid mistakes and their code output MUST be reviewed carefully so Deepseek v4 is not any different either, it too is just a random token generator based on token frequency distributions with no real thought process like all other models such as Claude Opus etc.
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.
Happy to discuss the forensic detection side. AudioSeal
watermarks, AASIST anti-spoofing, and how the detection landscape changes
once voice biometrics start leaking at scale.
> 1. Democratization. We will resist the potential of this technology to consolidate power in the hands of the few.
For example they could publish their models and research... instead of doing the opposite of what they claim being their very first principle.