More

m_ke · 2026-02-28T22:00:19 1772316019

It's more than that, supposedly Sama donated another 25mil through a PAC.

I'm sure the Crypto AI Czar (David Sacks) being a major Anthropic hater didn't hurt either

Or that Kushner put a billion in OpenAI recently

EDIT: wow they got in at a huge discount too and OpenAI bought stake in Thrive...

https://www.wsj.com/articles/thrive-capital-bought-shares-in...

https://openai.com/index/thrive-holdings/

Spooky23 · 2026-03-01T00:25:12 1772324712

POTUS pretty much told you this is what you are getting. His great admiration for Andrew Jackson pretty much says it all. Jackson was the poster child for bullshit populism, patronage and corruption.

m_ke · 2026-02-28T15:38:08 1772293088

You forgot the genocide they have going and the current attempt to starve Cuba into submission with their little "blockade"

reliabilityguy · 2026-02-28T15:55:00 1772294100

[flagged]

m_ke · 2026-02-28T16:06:21 1772294781

50% as in the barely armed males that were killed like fish in a barrel

https://www.thelancet.com/journals/langlo/article/PIIS2214-1...

> Women, children (ie, younger than 18 years), and older people (ie, older than 64 years) comprised 56·2% (95% CI 50·4–61·9) of violent deaths

parineum · 2026-02-28T16:21:19 1772295679

How many of those people under 18 are boys over 15?

Militias rarely have age restrictions.

autoexec · 2026-02-28T16:45:35 1772297135

Which still doesn't justify the slaughter or starvation of children

parineum · 2026-02-28T16:49:37 1772297377

I didn't say it did. I just think it's either naive or disengenuous to assume under 18 isn't a militant.

To a lesser extent, the same is true of women and elderly.

reliabilityguy · 2026-02-28T16:19:05 1772295545

I am not sure what is the point you are trying to make with these stats.

It is clear that if half of the killed were militants, the other half is not by definition.

50% of casualties being civilians does not mean it is a genocide.

m_ke · 2026-02-28T16:27:28 1772296048

if you slaughter civilians and label all males as combatants you conveniently get a near 50% militant death rate

don't play dumb, there's a reason israel is not letting foreign media into gaza and slaughtering local journalists at a rate never seen in history of war

reliabilityguy · 2026-02-28T16:31:56 1772296316

> if you slaughter civilians and label all males as combatants you conveniently get a near 50% militant death rate

Say the ratio is 1:4, then what?

> don't play dumb, there's a reason israel is not letting foreign media into gaza and slaughtering local journalists at a rate never seen in history of war

And, at the same time, they keep all the internet links alive so that Palestinians can show the whole world the "genocide"? Like, do you really think that Israelis are that dumb? Islamic Republic shut down the internet to hide the scope of butchery, but Israelis did not figure it out?

m_ke · 2026-02-28T16:52:22 1772297542

yes poor israel with it's nukes and iron dome is being oppressed by a bunch of women and children living in an open prison

now please tell me what you'd like to see happen with the remaining palestinians and what you expect to happen in the middle east after you destabilize another major country in the region

reliabilityguy · 2026-02-28T17:08:43 1772298523

> yes poor israel with it's nukes and iron dome is being oppressed by a bunch of women and children living in an open prison

This is "open prison": https://www.youtube.com/watch?v=jYCWjYBsr8M?

Truly oppressed people do not blow up themselves in cafes, busses, and schools. People in Iran are oppressed, their women are beaten for not covering their hair in the street, and yet, they do not blow up themselves.

m_ke · 2026-02-28T17:16:08 1772298968

answer my questions

also heres some nice footage of markets in warsaw ghetto for you https://youtu.be/a2a5qRkOqP4?si=twZ9zFYL3xh6Ms0h

as a Pole its sad to see so many jews get behind a fascist like Bibi. living in NYC i don't feel safer today and i don't see how the whole world turning on israel is good for jews long term

Trump shredding NATO and taking our random world leaders is also not making countries like Poland safer

reliabilityguy · 2026-02-28T17:31:31 1772299891

> answer my questions

Which ones?

You provided a 50:50 stats without any sort of reasoning or an argument. I asked what does it mean, and you completely ignored my question, but mentioned that Gaza is an open prison (which is not, as Palestinians can leave and come back, as many did pre-2023 war), and somehow said that if people are “oppressed”, it is okay for them to commit atrocities.

Now, I would expect that you as a Pole would be able to tell the difference between Warsaw ghetto and Gaza. I wonder why you choose this false equivalence: Jews did not attack Germany from Warsaw Ghetto, they did not launch rockets, kidnapped German civilians and kept them in captivity, jews could not leave.

> as a Pole its sad to see so many jews get behind a fascist like Bibi. living in NYC i don't feel safer today and i don't see how the whole world turning on israel is good for jews long term

And this is the fault of the jews, right? And not the people who make jews not safe?

hiddencost · 2026-02-28T16:03:26 1772294606

They're engaged in willful destruction of hospitals, they kill journalists on purpose, they have systematically blocked aid. Their friend minister recently declared an intent to eliminate all Palestinian territory.

You're just lying.

reliabilityguy · 2026-02-28T16:07:45 1772294865

> They're engaged in willful destruction of hospitals

If a civilian facility is used for military purposes it is a legitimate target. Ukranians also bomb schools and hospitals. Are Ukranians commit genocide?

If a hospital is never be attacked, what prevents militaries simply use hospitals as military bases? It's like the ultimate "get out of jail" free card.

> they kill journalists on purpose

US also did in Iraq. And? Does it make US's invasion of Iraq a genocide? Ukranians killed Russian journalists too. Does it make the war in Ukraine a genocide?

> they have systematically blocked aid

Egypt did so as well. Moreover, despite its international obligations, Egypt refused to accept Palestinian refugees as if it wanted a lot of civilians to die.

> Their friend minister recently declared an intent to eliminate all Palestinian territory.

You mean politicians pandering to their base?

> You're just lying.

Sure.

primroot · 2026-02-28T16:27:32 1772296052

Please provide sources. Genocide is not a matter of cherry-picking or of opinion. People who take this debate seriously look into context and evidence with a level of detail that goes beyond what can be covered here. Anyone interested in arguments and counterarguments will inevitable have to refer to authorities in the matter who have the background, time and resources.

autoexec · 2026-02-28T16:52:52 1772297572

Don't bother. He just effectively argued that there are no illegitimate targets in war because soldiers can be anywhere and that hospitals must be targeted or else they are "get out of jail free cards" whatever the fuck that means. War is war, but war crimes are still war crimes. No point trying to have rational discourse with someone advocating for war crimes.

reliabilityguy · 2026-02-28T17:11:09 1772298669

> He just effectively argued that there are no illegitimate targets in war

No, this is not what I've said.

> because soldiers can be anywhere and that hospitals must be targeted or else they are "get out of jail free cards" whatever the fuck that means.

The law is clear in this regard. If you use hospital for military purposes, it is a valid target.

> War is war, but war crimes are still war crimes.

When a hospital is used for military purposes and then attacked, it is not a war crime from the PoV of international law. You may not like it, but it is a fact.

> No point trying to have rational discourse with someone advocating for war crimes.

I think you are irrational here. Your reasoning is based on emotions, and not facts.

autoexec · 2026-02-28T17:23:10 1772299390

> The law is clear in this regard. If you use hospital for military purposes, it is a valid target.

This is wrong. Hospitals can only be valid targets if they are used to launch "acts harmful to the enemy". There are countless military purposes that still don't rise to that level. Sheltering soldiers, even using floors as war rooms for planning is not enough. Any response taken against a hospital must also be proportionate to the harm. Small arms fire from a hospital window does not justify bombing the entire building into rubble.

reliabilityguy · 2026-02-28T17:37:41 1772300261

> This is wrong.

No, it is not. Even hiding in the hospital make the hospital loose its protection (see here: https://lieber.westpoint.edu/legal-protection-hospitals-duri...)

This piece in particular:

> The ICRC’s Commentary cites as examples “firing at the enemy for reasons other than individual self-defence, installing a firing position in a medical post, the use of a hospital as a shelter for able-bodied combatants, as an arms or ammunition dump, or as a military observation post.” It also states that “transmitting information of military value” or being used “as a centre for liaison with fighting troops” results in loss of protection.

> Sheltering soldiers, even using floors as war rooms for planning is not enough.

It is enough for the hospital to loose its protection.

> Any response taken against a hospital must also be proportionate to the harm.

This is completely different question though: proportionality of response vs. protected status of various institutions and buildings at war.

reliabilityguy · 2026-02-28T16:42:26 1772296946

> Please provide sources.

Sources to what? Laws of war?

W.r.t. hospitals, you can read this article: https://lieber.westpoint.edu/legal-protection-hospitals-duri...

This piece in particular:

> The ICRC’s Commentary cites as examples “firing at the enemy for reasons other than individual self-defence, installing a firing position in a medical post, the use of a hospital as a shelter for able-bodied combatants, as an arms or ammunition dump, or as a military observation post.” It also states that “transmitting information of military value” or being used “as a centre for liaison with fighting troops” results in loss of protection.

So, given that Palestinians used schools consistently to hide weapons, are you saying that it never happens? It seems to me completely unreasonable to claim that Israelis destroyed "all the schools, hospitals, universities because they want genocide" very questionable given that Palestinians used civilian infrastructure and NGOs for its resistance in the past. If they did it, why won't they do it again?

Link: https://www.unrwa.org/newsroom/press-releases/unrwa-condemns...

> Genocide is not a matter of cherry-picking or of opinion.

Of course not. It is also not a a single %.

> People who take this debate seriously look into context and evidence with a level of detail that goes beyond what can be covered here. Anyone interested in arguments and counterarguments will inevitable have to refer to authorities in the matter who have the background, time and resources.

Absolutely. However, people here are using the term genocide as it is a settled matter. Moreover, their whole reasoning boils down to metrics that either show that any war is a genocide, or have no bearing at all.

Svoka · 2026-02-28T16:58:42 1772297922

Russian invasion of Ukraine is absolutely a genocidal war, with genocidal claims spoken out loud and actions documented, tens of thousands of times.

Never heard someone in USA claiming that Iraqis or Iranians had no right to exist, saying that they are not a real country and/or nation. This rhetoric is pretty much main stream in russia and used to justify ongoing genocide.

m_ke · 2026-02-28T04:21:15 1772252475

don't forget that Sama is a Thiel protege

m_ke · 2026-02-26T22:44:08 1772145848

It's going to get really ugly, Jason Lemkin called this out as a possibility a few hours ago: https://youtu.be/mBE_9vGJBUM?si=WSyZXYgV48WfrNrv&t=2908

We're about to see a lot of public SAAS companies do the same and rebrand as "AI" first

m_ke · 2026-02-24T21:06:22 1771967182

If OpenAI employees have an inch of spine left, they better demand Sama to take the same stance on this as Dario. No mass surveillance and no autonomous weapons.

blibble · 2026-02-24T21:09:21 1771967361

> If OpenAI employees have an inch of spine left

hahaha

good one

burnte · 2026-02-24T21:27:09 1771968429

Yep, that was resolved when he managed to make the board unfire him.

AIorNot · 2026-02-24T21:48:14 1771969694

Well just at look who Brockman donated too - he didnt give 25 freaking million to help end the surveillance state he gave it to Trump and co

https://gizmodo.com/openai-president-defends-trump-donations...

nova22033 · 2026-02-24T21:30:43 1771968643

yeah they should get up on stage and hold hands

https://fortune.com/2026/02/19/openai-anthropic-sam-altman-d...

mbac32768 · 2026-02-24T21:50:58 1771969858

You have to be a craven, hollowed out husk of a person if you let the DoD demand your AI be used for killing people or surveillance of Americans. Even if you believe America serves a positive role as world police, even if you're pro-Trump, you just have to see what a terrible precedent this sets.

Here's where I would expect the CEOs of the other AI labs to stand by Anthropic and say no.

Kim_Bruning · 2026-02-25T07:10:44 1772003444

Many of the OpenAI employees with an inch of spine already left - guess who founded Anthropic?

guelo · 2026-02-25T01:28:02 1771982882

Anthropic was founded when a bunch of OpenAI employees left after sama abandoned it's AI safety mission. So no, that's not going to happen.

outside1234 · 2026-02-24T21:47:11 1771969631

Sam would sell his mom to make $0.50. Pretty sure he will be willing to do whatever the Pentagon wants.

senordevnyc · 2026-02-25T04:43:14 1771994594

I always hear this view of Altman, but then why does he have no equity in OpenAI? What’s the greedy master plan there?

s1artibartfast · 2026-02-25T05:10:00 1771996200

He might get 5 to 10% in the restructuring that's underway. That would be 25 to 50 billion dollars

senordevnyc · 2026-02-28T02:47:15 1772246835

I thought that restructuring finished in the fall, and he still didn't get any equity?

m_ke · 2026-02-23T18:58:26 1771873106

we should probe anthropic for what accounts they made to access third party data, or which proxies they use to circumvent scraping blockers

m_ke · 2026-02-23T01:07:28 1771808848

1. Compete directly with their highest margin API customers

2. Buy up real businesses with the billions of equity and cash that they're raising

3. Lobby the government for regulations to stifle competition

4. Beg for a bailout or 0% loans when the music stops

5. Follow the Zuck playbook of copying competitors, spying on their users, spamming and addicting them, then squeezing the whales out of everything that they have (there's a reason why Anthropic and OpenAI have a bunch of Facebook execs leading their product groups)

m_ke · 2026-02-19T21:19:51 1771535991

It's the new underpaid employee that you're training to replace you.

People need to understand that we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data.

If you can record a human doing anything on a computer, we'll soon have a way to automate it

xyzzy123 · 2026-02-19T21:49:33 1771537773

Sure, but do you want abundance of software, or scarcity?

The price of having "star trek computers" is that people who work with computers have to adapt to the changes. Seems worth it?

worldsayshi · 2026-02-19T22:01:47 1771538507

My only objection here is that technology wont save us unless we also have a voice in how it is used. I don't think personal adaptation is enough for that. We need to adapt our ways to engage with power.

almostdeadguy · 2026-02-19T22:22:11 1771539731

Both abundance and scarcity can be bad. If you can't imagine a world where abundance of software is a very bad thing, I'd suggest you have a limited imagination?

krackers · 2026-02-20T01:41:05 1771551665

Abundance of services before abundance of physical resources seems like the worst of both worlds.

lanfeust6 · 2026-02-20T03:26:52 1771558012

Aggressively expanding solar would make electrical power a solved problem, and other previously non-abatable sources of kinetic energy are innovating to use this instead of fossil fuels

jimbokun · 2026-02-20T04:32:05 1771561925

It’s not worth it because we don’t have the Star Trek culture to go with it.

Given current political and business leadership across the world, we are headed to a dystopian hellscape and AI is speeding up the journey exponentially.

agumonkey · 2026-02-19T22:10:50 1771539050

It's a strange economical morbid dependency. AI companies promises incredible things but AI agents cannot produce it themselves, they need to eat you slowly first.

gtowey · 2026-02-19T23:05:13 1771542313

Perfect analogy for capitalism.

xnx · 2026-02-19T21:23:05 1771536185

Exactly. If there's any opportunity around AI it goes to those who have big troves of custom data (Google Workspace, Office 365, Adobe, Salesforce, etc.) or consultants adding data capture/surveillance of workers (especially high paid ones like engineers, doctors, lawyers).

mylifeandtimes · 2026-02-19T22:54:01 1771541641

> the new underpaid employee that you're training to replace you.

and who is also compiling a detailed log of your every action (and inaction) into a searchable data store -- which will certainly never, NEVER be used against you

badgersnake · 2026-02-19T22:02:46 1771538566

I think we’re past the “if only we had more training data” myth now. There are pretty obviously far more fundamental issues with LLMs than that.

m_ke · 2026-02-19T23:33:00 1771543980

i've been working in this field for a very long time, i promise you, if you can collect a dataset of a task you can train a model to repeat it.

the models do an amazing job interpolating and i actually think the lack of extrapolation is a feature that will allow us to have amazing tools and not as much risk of uncontrollable "AGI".

look at seedance 2.0, if a transformer can fit that, it can fit anything with enough data

Gigachad · 2026-02-19T21:59:23 1771538363

Data clearly isn't the only issue. LLMs have been trained on orders of magnitude more data than any person has ever seen.

polotics · 2026-02-19T21:37:07 1771537027

How much practice have you got on software development with agentic assistance. Which rough edges, surprising failure modes, unexpected strengths and weaknesses, have you already identified?

How much do you wish someone else had done your favorite SOTA LLM's RLHF?

cesarvarela · 2026-02-19T21:23:38 1771536218

LLMs have a large quantity of chess data and still can't play for shit.

dwohnitmok · 2026-02-19T21:43:27 1771537407

Not anymore. This benchmark is for LLM chess ability: https://github.com/lightnesscaster/Chess-LLM-Benchmark?tab=r.... LLMs are graded according to FIDE rules so e.g. two illegal moves in a game leads to an immediate loss.

This benchmark doesn't have the latest models from the last two months, but Gemini 3 (with no tools) is already at 1750 - 1800 FIDE, which is approximately probably around 1900 - 2000 USCF (about USCF expert level). This is enough to beat almost everyone at your local chess club.

runarberg · 2026-02-19T22:02:17 1771538537

Wait, I may be missing something here. These benchmarks are gathered by having models play each other, and the second illegal move forfeits the game. This seems like a flawed method as the models who are more prone to illegal moves are going to bump the ratings of the models who are less likely.

Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.

For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.

https://chessbenchllm.onrender.com/games?page=5&model=gemini...

I suspect the ratings here may be significantly inflated due to a flaw in the methodology.

EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).

dwohnitmok · 2026-02-20T01:15:18 1771550118

The LLMs do play rated engines (maia and eubos). They provide the baselines. Gemini e.g. consistently beats the different maia versions.

The rest is taken care of by elo. That is they then play each other as well, but it is not really possible for Gemini to have a higher elo than maia with such a small sample size (and such weak other LLMs).

Elo doesn't let you inflate your score by playing low ranked opponents if there are known baselines (rated engines) because the rated engines will promptly crush your elo.

You could add humans into the mix, the benchmark just gets expensive.

runarberg · 2026-02-20T17:32:49 1771608769

I did indeed miss something. I learned after posting (but before my EDIT) that there are anchor engines that they play.

However these benchmarks still have flaws. The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

Second (and this is a minor one) Maia 1900 is currently rated at 1774 on lichess[2], but is 1816 on the leaderboard, to the author’s credit they do admit this in their methodology section.

Third, and this is a curiosity, gemini-3-pro-preview seems to have played the same game twice against Maia 1900[3][4] and in both cases Maia 1900 blundered (quite suspiciously might I add) mate in one when in a winning position with Qa3?? Another curiosity about this game. Gemini consistently played the top 2 moves on lichess. Until 16. ...O-O! (which has never been played on lichess) Gemini had played 14 most popular lichess moves, and 2 second most popular. That said I’m not gonna rule out that the fact that this game is listed twice might stem from an innocent data entry error.

And finally, apart from Gemini (and Survival bot for some reason?), LLMs seem unable to pass Maia-1100 (rated 1635 on lichess). The only anchor bot before that is random bot. And predictably LLMs cluster on both sides of it, meaning they play as well as random (apart from the illegal moves). This smells like benchmaxxing from Gemini. I would guess that the entire lichess repertoire features prominently in Gemini’s training data, and the model has memorized it really well. And is able to play extremely well if it only has to play 5-6 novel moves (especially when their opponent blunders checkmate in 1).

1: https://github.com/lightnesscaster/Chess-LLM-Benchmark/commi...

2: https://lichess.org/@/maia9

3: https://chessbenchllm.onrender.com/game/6574c5d6-c85a-4cb3-b...

4: https://chessbenchllm.onrender.com/game/4af82d60-8ef4-47d8-8...

dwohnitmok · 2026-02-20T17:59:05 1771610345

> The two illegal moves = forfeit is an odd rule which the authors of the benchmarks (which in this case was Claude Code) added[1] for mysterious reasons. In competitive play if you play an illegal move you forfeit the game.

This is not true. This is clearly spelled out in FIDE rules and is upheld at tournaments. First illegal move is a warning and reset. Second illegal move is forfeit. See here https://rcc.fide.com/article7/

I doubt GDM is benchmarkmaxxing on chess. Gemini is a weird model that acts very differently from other LLMs so it doesn't surprise me that it has a different capability profile.

runarberg · 2026-02-20T18:40:10 1771612810

>> 7.5.5 After the action taken under Article 7.5.1, 7.5.2, 7.5.3 or 7.5.4 for the first completed illegal move by a player, the arbiter shall give two minutes extra time to his/her opponent; for the second completed illegal move by the same player the arbiter shall declare the game lost by this player. However, the game is drawn if the position is such that the opponent cannot checkmate the player’s king by any possible series of legal moves.

I stand corrected.

I’ve never actually played competitive chess, I’ve just heard this from people who do. And I thought I remembered once in the Icelandic championships where a player touched one piece but moved the other, and subsequently made to forfeit the game.

runarberg · 2026-02-20T18:57:09 1771613829

Replying in a split thread to clearly separate where I was wrong.

If Gemini is so good at chess because of a non-LLM feature of the model, then it is kind of disingenuous to rate it as an LLM and claim that LLMs are approaching 2000 ELO. But the fact it still plays illegal moves sometimes, is biased towards popular moves, etc. makes me think that chess is still handled by an LLM, and makes me suspect benchmaxxing.

But even if no foul play, and Gemini is truly a capable chess player with nothing but an LLM underneath it, then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot. My fourth point above was my strongest point. There are only 4 anchor engines, one beats all LLMs, second beats all except Gemini, the third beats all LLMs except Gemini and Survival bot (what is Survival bot even doing there?) and the forth is random bot.

dwohnitmok · 2026-02-21T06:21:52 1771654912

Gemini is an LLM. It playing chess is not relying on a non-LLM module of some sort. I'm just saying that as an LLM, Gemini has a peculiar profile compared to other LLMs (likely an artifact of its post-training process). In particular Gemini is very capable, but also quite misaligned (it will more often actively sabotage users).

> then all we can conclude is that Gemini can play chess well, and we cannot generalize to other LLMs who play about the level of random bot

That's overly reductive. That would be true if we didn't see improvement over time from the other LLMs but we clearly do. In particular, even if Gemini is benchmarkmaxxing, this means that LLMs from other labs will eventually get there as well. Benchmarkmaxxing can be thought of as "premature" reaching of benchmarks. But I can't think of a single benchmark that was benchmarkmaxxed that wasn't eventually saturated by every single LLM provider (because being able to benchmarkmaxx serves as an existence proof that there is an LLM capable of it and as more training gets done on the LLMs the other ones get there).

runarberg · 2026-02-21T15:10:59 1771686659

The problem with benchmaxxing is that lies about the capabilities of the technology. IF all we wanted was a machine that plays chess, we would just use a chess engine, which we have known how to make for decades. If Google wanted Gemini to be able to play chess, it would be much easier (and better; and hellavulat cheaper) to stick a traditional chess engine into their product and defer all chess to that engine.

The claim here (way up thread) was: “we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data”, and the implication is that logic and reasoning is an emerging properties of these models if given enough data and enough parameters. However the evidence seems to suggest otherwise. Logic and reasoning have to be specifically programmed into these models, and even with dataset as vast as online chess games (just lichess has 7.1 billion games), if that claim above were true, chess should be easy for LLMs, but it obviously isn’t. And that tells us something about the limitations of the technology.

emp17344 · 2026-02-19T22:22:33 1771539753

That’s a devastating benchmark design flaw. Sick of these bullshit benchmarks designed solely to hype AI. AI boosters turn around and use them as ammo, despite not understanding them.

famouswaffles · 2026-02-19T22:52:18 1771541538

Relax. Anyone who's genuinely interested in the question will see with a few searches that LLMs can play chess fine, although the post-trained models mostly seem to be regressed. Problem is people are more interested in validating their own assumptions than anything else.

https://arxiv.org/abs/2403.15498

https://arxiv.org/abs/2501.17186

https://github.com/adamkarvonen/chess_gpt_eval

runarberg · 2026-02-19T22:50:17 1771541417

I like this game between grok-4.1-fast and maia-1100 (engine, not LLM).

https://chessbenchllm.onrender.com/game/37d0d260-d63b-4e41-9...

This exact game has been played 60 thousand times on lichess. The peace sacrifice Grok performed on move 6 has been played 5 million times on lichess. Every single move Grok made is also the top played move on lichess.

This reminds me of Stefan Zweig’s The Royal Game where the protagonist survived Nazi torture by memorizing every game in a chess book his torturers dropped (excellent book btw. and I am aware I just committed Godwin’s law here; also aware of the irony here). The protagonist became “good” at chess, simply by memorizing a lot of games.

famouswaffles · 2026-02-19T22:57:03 1771541823

The LLMs that can play chess, i.e not make an illegal move every game do not play it simply by memorized plays.

dwohnitmok · 2026-02-20T01:16:22 1771550182

> That’s a devastating benchmark design flaw

I think parent simply missed until their later reply that the benchmark includes rated engines.

cesarvarela · 2026-02-19T22:02:05 1771538525

Yeah, but 1800 FIDE players don't make illegal moves, and Gemini does.

dwohnitmok · 2026-02-20T01:20:48 1771550448

1800 FIDE players do make illegal moves. I believe they make about one to two orders of magnitude less illegal moves than Gemini 3 does here. IIRC the usual statistic for expert chess play is about 0.02% of expert chess games have an illegal move (I can look that up later if there's interest to be sure), but that is only the ones that made it into the final game notation (and weren't e.g. corrected at the board by an opponent or arbiter). So that should be a lower bound (hence why it could be up to one order lower, although I suspect two orders is still probably closer to the truth).

Whether or not we'll see LLMs continue to get a lower error rate to make up for those orders of magnitude remains to be seen (I could see it go either way in the next two years based on the current rate of progress).

overgard · 2026-02-20T21:06:46 1771621606

I think LLM's are just fundamentally the wrong AI technique for games like this. You don't want a prediction for the next move, you want the best move given knowledge of how things would play out 18 moves ahead if both players played the optimal move. Outside of an academic interest/curiosity, there isn't really a reason to use LLMs for chess other than thinking LLMs will turn into AGI (I doubt it)

cesarvarela · 2026-02-20T03:41:40 1771558900

A player at that level making an illegal move is either tired, distracted, drunk, etc. An LLM makes it because it does not really "understand" the rules of chess.

runarberg · 2026-02-21T15:20:05 1771687205

I suspect the majority of these illegal moves are in blitz or bullet tournaments in game 12 of the third day, and the player touches one peace but moves another, or hits the clock with the hand that didn’t make the move, or hits the clock without making a move. I don‘t think any expert level chess player grabs a captured rook and places it on the board, or moves a light squared bishop to a dark square, unless they are hustling at the park, in which case (it can be argued) moves like this with a slight of hand is part of the game.

famouswaffles · 2026-02-19T22:51:29 1771541489

That benchmark methodology isn't great, but regardless, LLMs can be trained to play Chess with a 99.8% legal move rate.

recursive · 2026-02-19T23:43:42 1771544622

That doesn't exactly sound like strong chess play.

dwohnitmok · 2026-02-20T01:31:00 1771551060

It's enough to reliably beat amateur (e.g. maia-1900) chess engines.

deadbabe · 2026-02-19T21:57:28 1771538248

Why do we care about this? Chess AI have long been solved problems and LLMs are just an overly brute forced approach. They will never become very efficient chess players.

The correct solution is to have a conventional chess AI as a tool and use the LLM as a front end for humanized output. A software engineer who proposes just doing it all via raw LLM should be fired.

rodiger · 2026-02-19T22:01:14 1771538474

It's a proxy for generalized reasoning.

The point isn't that LLMs are the best AI architecture for chess.

runarberg · 2026-02-19T22:05:44 1771538744

> It's a proxy for generalized reasoning.

And so for I am only convinced that they have only succeeded on appearing to have generalized reasoning. That is, when an LLM plays chess they are performing Searle’s Chinese room thought experiment while claiming to pass the Turing test

deadbabe · 2026-02-20T00:10:25 1771546225

Why? Beating chess is more about searching a probability space, not reasoning.

Reasoning would be more like the car wash question.

famouswaffles · 2026-02-20T03:38:12 1771558692

It's not entirely clear how LLMs that can play chess do so, but it is clearly very different from the way other machines do so. The construct a board, they can estimate a players skill and adjust accordingly, and unlike other machines and similarly to humans, they are sensitive to how a certain position came to be when predicting the next move.

Regardless, there's plenty of reasoning in chess.

deadbabe · 2026-02-20T17:28:45 1771608525

It’s very clear how, chess moves and positions are vector encoded into their training data, when they are prompted with a certain board state, they respond with the most probable response to that. There is no reason.

famouswaffles · 2026-02-20T21:16:27 1771622187

Actual Researchers can't give you a complete answer but you can. Whatever you say.

deadbabe · 2026-02-21T19:13:51 1771701231

Guess what, I’m a researcher. I have published papers.

famouswaffles · 2026-02-23T04:46:41 1771822001

overgard · 2026-02-20T01:44:01 1771551841

They have literally every chess game in existence to train on, and they can't do better than 1800?

jimbokun · 2026-02-20T04:34:19 1771562059

Why do you think they won’t continue to improve?

overgard · 2026-02-20T20:29:44 1771619384

Because of how LLM's work. I don't know exactly how they're using it for chess, but here's a guess. If you consider the chess game a "conversation" between two opponents, the moves written out would be the context window. So you're asking the LLM, "given these last 30 moves, what's the most likely next move?". Ie, you're giving it a string like "1. e4 e5, 2. Nf3 Nc6, 3. Bb5 a6, 4..?".

That's basically what you're doing with LLMs in any context "Here's a set of tokens, what's the most likely continuation?". The problem is, that's the wrong question for a chess move. If you're going with "most likely continuation", that will work great for openings and well-studied move sequences (there are a lot of well studied move sequences!), however, once the game becomes "a brand new game", as chess streamers like to say when there's no longer a game in the database with that set of moves, then "what's the most likely continuation from this position?" is not the right question.

Non-LLM AI's have obviously solved chess, so, it doesn't really matter -- I think Chess shows how LLM's lack of a world model as Gary Marcus would say is a problem.

nicowesterdale · 2026-02-27T02:46:50 1772160410

I wrote a, I hope, amusing breakdown of the structural reasons why off-the-shelf Large Language Models physically cannot "see" a chess board, and continue to make illegal moves, and teleport pieces as seen in Gotham Chess' latest videos.

https://www.nicowesterdale.com/blog/why-llms-cant-play-chess

iugtmkbdfil834 · 2026-02-19T21:31:38 1771536698

Hm.. but do they need it.. at this point, we do have custom tools that beat humans. In a sense, all LLM need is a way to connect to that tool ( and the same is true is for counting and many other aspects ).

Windchaser · 2026-02-19T21:46:01 1771537561

Yeah, but you know that manually telling the LLM to operate other custom tools is not going to be a long-term solution. And if an LLM could design, create, and operate a separate model, and then return/translate its results to you, that would be huge, but it also seems far away.

But I'm ignorant here. Can anyone with a better background of SOTA ML tell me if this is being pursued, and if so, how far away it is? (And if not, what are the arguments against it, or what other approaches might deliver similar capacities?)

yunyu · 2026-02-19T22:27:05 1771540025

This has been happening for the past year on verifiable problems (did the change you made in your codebase work end-to-end, does this mathematical expression validate, did I win this chess match, etc...). The bulk of data, RL environment, and inference spend right now is on coding agents (or broadly speaking, tool use agents that can make their own tools).

Recent advances in mathematical/physics research have all been with coding agents making their own "tools" by writing programs: https://openai.com/index/new-result-theoretical-physics/

menaerus · 2026-02-19T22:05:45 1771538745

Did you already forget about the AlphaZero?

BeetleB · 2026-02-19T22:02:04 1771538524

Are you saying an LLM can't produce a chess engine that will easily beat you?

emp17344 · 2026-02-19T22:23:32 1771539812

Plagiarizing Stockfish doesn’t make me good at chess. Same principle applies.

m_ke · 2026-02-19T17:53:50 1771523630

this is mostly because RLVR is driving all of the recent gains, and you can continue improving the model by running it longer (+ adding new tasks / verifiers)

so we'll keep seeing more frequent flag planting checkpoint releases to not allow anyone to be able to claim SOTA for too long

m_ke · 2026-02-18T20:00:22 1771444822

also i wasn't concerned about open chinese models till the latest iteration of agentic models.

most open claw users have no idea how easy it is to add backdoors to these models and now they're getting free reign on your computer to do anything they want.

the risks were minimal with last generation of chat models, but now that they do tool calling and long horizon execution with little to no supervision it's going to become a real problem

8cvor6j844qw_d6 · 2026-02-18T20:09:04 1771445344

I went with an isolated Raspberry Pi and a separate chat account and network.

The only remaining risk is the API keys, but easily isolated.

Although I think having direct access on your primary PC may make it more useful, the potential risk is too much for my appetite.

oxag3n · 2026-02-18T20:22:46 1771446166

The only remaining risk? Considering wide range of bad actors and their intent, stealing your API keys is the last thing I'd worry about. People ended up in prison for things done on their computers, usually by them.

8cvor6j844qw_d6 · 2026-02-18T20:32:07 1771446727

Unless you're proposing never touching OpenClaw, how will you set it up to your satisfaction in terms of security?

> stealing your API keys is the last thing I'd worry about

I don't know, I very much prefer the API credits not being burned needlessly.

Now that I think of it, is there ever a case where an Anthrophic account is banned due to the related API keys being misused?

iugtmkbdfil834 · 2026-02-18T20:27:44 1771446464

This is genuinely the only way to do it now in a way that will not virtually guarantee some new and exciting ways to subvert your system. I briefly toyed with an idea of giving agent a vm playground, but I scrapped it after a while. I gave mine an old ( by today's standards ) pentium box and small local model to draw from, but, in truth, the only thing it really does is limit the amount of damage it can cause. The underlying issue remains in place.